I have a table with approximately 2.5 million rows that I am thinking about moving into a much larger table, 35 million rows, with a boolean flag set on the original 2.5 million.
If I wanted to run lots of queries against the 2.5 million records in the new larger table, would adding an index be useful / not cause a full table scan on every query? I know that traditionally indexes aren't helpful in booleans, but since only 7% of the records will be true, I thought it might not require a table scan on every query.
Perhaps look at using a partial index.
From docs
A partial index is an index built over a subset of a table; the subset
is defined by a conditional expression (called the predicate of the
partial index). The index contains entries for only those table rows
that satisfy the predicate.
A major motivation for partial indexes is to avoid indexing common
values. Since a query searching for a common value (one that accounts
for more than a few percent of all the table rows) will not use the
index anyway, there is no point in keeping those rows in the index at
all. This reduces the size of the index, which will speed up queries
that do use the index. It will also speed up many table update
operations because the index does not need to be updated in all cases.
Example 11-1 shows a possible application of this idea.
I would be looking at partitioning, if you have a substantial proportion of the table that you want to access efficiently.
If you do "insert into big select * from small", then all of the rows that came from the small table are likely to be physically close to each other. After analyzing the table, PostgreSQL will know this, and so will probably choose to use the index on the boolean.
But, if there a lot of churn in the rows then eventually the "true" rows and the "false" rows will become all jumbled up, making use of the index less and less effective, and PostgreSQL will stop using it.
By using partitioning/inheritance, you can keep the rows physically separate (to make sequential scanning on just the small set faster) while making them look like a single set of data when you want to.
Depending on the nature of the queries you run, you might also benefit from adding other columns to the index, keeping the boolean column as the first column.
Related
I'm studying some SQL and regarding queries it says :
Create an index when the table is large and most queries are expected
to retrieve less than 2% to 4% of the rows in the table.
I just want to get a mental picture of what is meant by the statement. I understand that index is to make your query go much faster. Is it because the index will be focused on only that 2% to 4% of the table?
Databases store data in pages, data pages. One way to make queries more efficient is by reducing the number of data pages that need to be read.
A typical database page might be 8k - 64k in size. If the records are really big, there might be just one record per page. If the records are quite small, there might be hundreds or even thousands on each page.
When you have an index and a condition on the column in the where clause, this restricts the number of rows. The proportion is called the "selectivity" of the where clause.
The SQL engine has two ways of satisfying such a where clause. It can read every row and compare values in each row to the condition. This is called a "full table scan". It can just look up the appropriate values in an index. This is called an "index scan".
Now, when using an index for a where clause, what we want to do is to reduce the number of data pages being read. This happens when we are reading, on average, less than one record per page. This is where the 2% - 4% comes from. Do note that if you have very large records, the number could be much larger.
However, there is a problem with this heuristic. Indexes are used for other purposes:
An index can be used to retrieve records in order, if the index matches the order by clause (and other conditions in the query are true).
An index can be used for joining records.
An index can be used to satisfy a query in its entirety, if the columns in the index are the only ones in the query (in this case, one says that the index "covers" the query).
So, the information you read is a heuristic. It is a useful guideline, but it is definitely not set in stone.
No, ff your index is on the columns in your where condition, you will not have to do a table scan. Not doing a table scan is beneficial if a smaller portion of rows is being returned.
If you are returning 100% of the rows, there is no major difference between a table scan or an index scan.
I was asked a question today on when wouldn't I want to create a SQL Index on a table.
The only thing I can think of is when you don't need one (i.e. a small table). That answer doesn't feel right. Is there a thresh-hold on when I should use an index and when I shouldn't?
When not to create an index on the table, there are lots of things to consider.
First, is that there are a lot of possible indexes you could create. For example, you could create an index containing not only every column in the table, but every permutation of the columns (since column ordering in indexes does matter). This could be a huge number of indexes as your column count gets higher.
Every index comes with a number of things that decrease performance in different ways. For example, they may take memory/disk space from what is available. Probably worse than this though, is the fact that indexes need to be updated when the table underneath it is updated. This means that every insert/update/delete in a table, can trigger an index update. As you have more indexes, that's more indexes to update, which can kill performance on your CUD operations, and can kill your server performance if you are doing these often.
Because of this performance impact, you want to avoid 'useless' indexes. Indexes that are used for every query are typically good, but an index used only once a day for a <1s query is probably useless. It's all a tradeoff in attempting to determine which indexes are useful enough to use and whose performance benefits are greater than the performance hits.
You could answer it with the conter question: When do you need an index?
You need an index, if you want to search for entries, to get your results faster. For example if the column is used in a where clause. Of course you could try index everything, but indexing will cause you to use extra memory/hard disk. So you only index columns you use to find your rows.
What rows MySQL for example is reading while trying to find your rows, you can analyze with the EXPLAIN command.
Does this help?
A rule of thumb is, to drop all indices except the unique index on the primary key, on small tables (less than about 100'000 rows).
Also, it is not appropriate to use an index, if the column is not for search purpose (e.g. the salary of employees).
I have created script to find selectivity of each columns for every tables. In those some tables with less than 100 rows but selectivity of column is more than 50%.
where Selectivity = Distinct Values / Total Number Rows
So, is those column are eligible for index?
Or, can you tell, how much minimum rows require for eligibility for creating index?
I think I understand what you are trying to accomplish by calculating a 'Selectivity' value for your data but you cannot apply the rule blindly.
In fact in for certain queries the 'Selectivity' value might be really low an index will still be very beneficial. For example:
Assume a 'inbox' table with millions of rows, these rows have a 'Read' boolean field. In this case the distinct values over the number of rows will be really low. If most items are read most of the time then finding unread items with an index on this field will be very efficient.
Creating indexes index come at a cost. Although you get the benefit for reads, you pay for writes and disk usage.
I would rather recommend you profile your queries and index accordingly. You can also look at the data from sys.dm_db_missing_index_group_stats and other Dynamic management views that will give you insight on indexes usage (or missing) ones.
You can create a index on a table with 0 rows, 1 row or a 100 million rows. You can create an index where every column has the same value or unique values.
So you can create an index. The question is really should you create an index and no tool is going to tell you that because indexes can also be multi-value and it depends on what queries you run. Creating indexes is something done when performance tuning queries or preemptively when you know that you'll be creating queries that are using it.
Every index comes with a cost in terms of space and time required to do updates, inserts and deletes. You don't want to be creating them spuriously so you're really going to have to do this by hand, not as a result of a script to see how unique the value of a column is.
A general rule of thumb says that if you have a very large table (over 1 million rows), you should only use an index if a WHERE clause based on that index selects at most something in the neighborhood of 1-2% of the data.
If you have a "gender" column and roughly 50% of values are "male" and roughly 50% "female", then having an index on that really doesn't give you much - SQL Server and most other RDBMS will most likely still do a full table scan in this case, since on average, they'd have to scan at least half the table anyway, so the "detour" by using an index first and then looking up the actual full data based on that index value is just not worth it.
An index is excellent if you have something like unique keys (customer number), or a value that is quite selective. An index is not without cost - it uses up disk space, it needs to be maintained, it will slightly slow down all operations besides the SELECT - so thread carefully, it's not the best idea to just blindly index everything. Having too few indices is bad - but having too many, and the wrong ones, can be even worse! :-) Nobody ever claimed getting your indices right was easy.... :-)
But there's definitely help out there - the best source I know are Kimberly Tripp's excellent blog posts on SQL Server indexing (and many other topics).
Marc
I'm updating tables with millions of records and I need to be as efficient as possible. Is there a point at which adding more criteria to the where clause will actually hurt rather than help?
For example, if know I want to set a column to 3 I could use this query:
update mytable set col = 3
Or I could update the record only if it's different
update mytable set col = 3 where col <> 3
I could also filter it so it only updates records added since the last time I ran this process
update mytable set col = 3 where col <> 3 and createDate > #lastRunDate
And perhaps I could look for more things in additional columns.
I guess my question is if there is a point where the cost of looking at additional columns outweighs the cost of the update itself and if there's a principle you can use to determine where to draw the line.
Update
So here's the principle I'm trying to piece together based on what was said. Feel free to argue with this and I'll update it accordingly:
If no indexed columns to filter on, add as much criteria as possible to limit the records being updated since a full table scan is going to happen anyway.
If the difference in records between filtering on only indexed columns and filtering on all possible columns is marginal, only use the indexed columns and avoid the full table scan.
If you have a mix of indexed and non-indexed columns, definitely use the indexed columns if you can and only use non-index columns if... [[I'm still struggling with this part. What's the threshold for introducing the non-indexed columns in the where clause?]]
Update #2
Sounds like I have my answer.
If you have an index on "col", then running your first query will update millions of rows regardless; your second query would potentially only update a few and find those quickly if there's an index available. If you don't have an index on that column, the effect will be marginal since a full table or index scan must occur to check all rows in your table (you'll just have fewer actual updates, but that's it).
The whole point of restricting your queries usnig WHERE clauses is to reduce the scope of your query, e.g. the number of rows SQL Server has to look at. Less data to process is always faster than just doing it to all millions of rows......
In response to your update: the main goal of using a WHERE clause is to reduce the number of rows you need to inspect / touch. If you have a means (typically an index) to reduce that number from 100% to a few percent, then it's definitely worth it. That's the whole point of having indices (mostly for SELECTs, but applies to other operations, too, of course).
If you have a suitable index, and thus you can pluck out a few hundred rows to check against a criteria versus having to inspect millions of rows, you'll always be faster. If you have a good book index in a bookstore that guides you easily to the two shelves where the books that interest you are located, you'll find what you're looking for more quickly than when you have to criss-cross the whole bookstore since there's no index available.
There obviously is a point where yet another criteria or index doesn't help anymore. If that's the case, typically yet another WHERE clause won't really help much - or at all. But in this case, the SQL query optimizer will find those cases and filter them out (possibly even just ignoring them when deciding on what the best query execution plan is).
This really comes down to index usage and query optimization. I would suggest looking at the query plan before making any decisions.
Adding indexed fields to the where clause will often improve query time, however, adding non-indexed fields can result in table scans which will slow your query.
My suggestion is write a query that works, look at the execution time, work to reduce it to an exceptable level by looking at the query plan. Don't over optimize, go for the acceptable solution.
Say I have a table with a large number of rows and one of the columns which I want to index can have one of 20 values.
If I were to put an index on the column would it be large?
If so, why? If I were to partition the data into the data into 20 tables, one for each value of the column, the index size would be trivial but the indexing effect would be the same.
It's not the indexes that will suck. It's putting indexes on the wrong columns that will suck.
Seriously though, why would you need a table with a single column? What would the meaning of that data be? What purpose would it serve?
And 20 tables? I suggest you read up on database design first, or otherwise explain to us the context of your question.
Indexes (or indices) don't suck. A lot of very smart people have spent a truly remarkable amount of time of the last several decades ensuring that this is so.
Your schema, however, lacking the same amount of expertise and effort, may suck very badly indeed.
Partitioning, in the case described is equivalent to applying a clustered index. If the table is sorted otherwise (or is in arbitrary order) then the index necessarily has to occupy much more space. Depending on the platform, a non-clustered index may reduce in size as the sortedness of the rows with respect to the indexed value increases.
YMMV.
The short answer:
Do indexes suck: Yes and No
The longer answer:
They don't suck if used properly. Maybe you should start reading about how indexes work, why they can work and why they sometimes don't work.
Good starting points:
http://www.sqlservercentral.com/articles/Indexing/
No indexes don't suck, but you have to pay attention to how you use them or they can backfire on the performance of your queries.
First: Schema / design
Why would you create a table with only one column? That's probably taking normalization one step to far. Database design is one of the most important things to consider in optimizing performance
Second: Indexes
In a nutshell the indexes will help the database to perform a binary search of your record. Without an index on a column (or set of columns) the database will often fall back to a table scan. A table scan is very expensive because it involves enumerating each and every record.
It doesn't really matter THAT much for index scans how many records there are in the database table. Because of the (balanced) binary tree search doubling the amount of records will only result in one extra search step.
Determine the primary key of your table, SQL will automatically place a clustered index on that column(s). Clustered indexes perform really well. In addition you can place non-clustered indexes on columns that are used often in SELECT, JOIN, WHERE, GROUP BY and ORDER BY statements. Do remember that indexes have a certain overlap, try to never include your clustered index into a non-clustered index.
Also interesting might be the fill factor on the indexes. Do you want to optimize your table for reads (high fill factor - less storage, less IO) or for writes (low fill factor more storage, less rebuilding your database pages).
Third: Partitioning
One of the reasons to use partitioning is to optimize your data access. Let's say you have 1 million records of which 500,000 records are no longer relevant but stored for archiving purposes. In this case you could decide to partition the table and store the 500,000 old records on slow storage and the other 500,000 records on fast storage.
To measure is to know
The best way to get insight in what happens is to measure what happens to your cpu and io. Microsoft SQL server has some tools like the Profiler and Execution plans in Management Studio that will tell you the duration of your query, number of read/writes and cpu usage. Also the execution plan will tell you which or IF indexes are being used. To your surprise you might see a table scan although you didn't expect it.
Say I have a table with a large number of rows and one column which I want to index can have one of 20 values. If I were to put an index on the column would it be large?
The index size will be proportional to the number of your rows and the length of the indexed values.
The index keeps not only the indexed value, but also some kind of a pointer to the row (ROWID in Oracle, LCID in PostgreSQL, primary key in InnoDB etc).
If you have 10,000 rows and a 1 distinct value, you will still have 10,000 records in your index.
If so, why? If I were to partition the data into the data into 20 tables, one for each value of the column, the index size would be trivial but the indexing effect would be the same
In this case, you would come with 20 indexes being same in size in total as your original one.
This technique is sometimes used in fact in such called partitioned indexes. It has its advantages and drawbacks.
Standard b-tree indexes are best suited to fairly selective indexes, which this example would not be. You don't say what DBMS you are using; Oracle has another type of index called a bitmap index which is more suited to low-selectivity indexes in OLAP environments (since these indexes are expensive to maintain, making them unsuitable for OLTP environments).
The optimiser will decide bases on stats whether it thinks the index will help get the data in the fastest time; if it won't, the optmiser won't use it.
Partitioning is another strategy. In Oracle you can define a table as partitioned on some set of columns, and for the optimiser can automatically perform "partition elimination" like you suggest.
Sorry, I'm not quite sure what you mean by "large".
If your index is clustered, all the data for each record will be on the same leaf page, thereby creating the most efficient index available to your table as long as you write your queries against it properly.
If your index is non-clustered, then only the index related data will be on your leaf pages. Then, depending on suchs things as how many other indexes you have, coupled with details like your fill factor, your index may or may not be efficient. In general, if you don't have a ton of indexes on your table, you should be safe.
The efficiency of your index will also be determined by the data type of the 20 values you're speaking of going into the column. If those are pre-defined values, then their details should probably be in a lookup table with a simple primary key datatype (like Int/Number). Then add that column to your table as a foreign key with an index on the column.
Ultimately, you could have a perfect index on a column. But it's best use will be determined for the most part by the queries you write. So if your queries make use of the indexes, you're golden.
Indexes are purely for performance. If an index doesn't boost performance for the queries you're interested in, then it sucks.
As for disk usage, you have to weigh your concerns. Different SQL providers build indexes differently, but as a client, you generally trust that they do the best that can be done. In the case you're describing, a clustered index may be optimal for both size and performance.
It would be large enough to hold those values for all the rows, in a sorted order.
Say you have 20 different strings of 4 characters, and 1 million rows, it would at least be 4 million bytes (or 8 if 16-bit unicode) to hold those values.