Index on column with only 2 distinct values - sql

I am wondering about the performance of this index:
I have an "Invalid" varchar(1) column that has 2 values: NULL or 'Y'
I have an index on (invalid), as well as (invalid, last_validated)
Last_validated is a datetime (this is used for a unrelated SELECT query)
I am flagging a small amount of items (1-5%) of rows in the table with this as 'to be deleted'.
This is so when i
DELETE FROM items WHERE invalid='Y'
it does not perform a full table scan for the invalid items.
A problem seems to be, the actual DELETE is quite slow now, possibly because all the indexes are being removed as they are deleted.
Would a bitmap index provide better performance for this? or perhaps no index at all?

Index should be used, but DELETE can still take some time.
Have a look at the execution plan of the DELETE:
EXPLAIN PLAN FOR
DELETE FROM items WHERE invalid='Y';
SELECT * FROM TABLE( dbms_xplan.display );
You could try using a Bitmap Index, but I doubt that it will have much impact on performance.
Using NULL as value is not a good idea. The query
SELECT something FROM items WHERE invalid IS NULL
would not be able to use your index, since it only contains not-null values.

As Peter suggested, it's important to first verify that the index is being used for the delete. Bitmap indexes will invoke other locking for DML that could hurt overall performance.
Additional considerations:
are there unindexed foreign key
references to this table from other
tables?
are there triggers on this table
that are performing other DML?

Two thoughts on this...
Using NULL to express the opposite of 'Y' is possibly not a good idea. Null means *I don't know what this value is' or 'there is no meaningful answer to a question'. You should really use 'N' as the opposite of 'Y'. This would eliminate the problem of searching for valid items, because Oracle will not use the index on that column when it contains only non-null values.
You may want to consider adding a CHECK CONSTRAINT on such a column to ensure that only legal values are entered.
Neither of these changes necessarily has any impact on DELETE performance however.

I recommend:
check how many records you expect the DELETE to affect (i.e. maybe there are more than you expect)
if the number of rows that should be deleted is relatively small, check that the index on invalid is actually being used by the DELETE
get a trace on the session to see what it is doing - it might be reading more blocks from disk than expected, or it might be waiting (e.g. record locking or latch contention)
Don't bother dropping or creating indexes until you have an idea of what actually is going on. You could make all kinds of changes, see an improvement (but not know why it improved), then months down the track the problem reoccurs or is even worse.

Drop the index on (invalid) and try both SELECT and DELETE. You already have an index on (invalid,last_validated). You should not be needing the index on invalid alone.Also approximately how many rows are there in this table ?

Related

SQL Server table optimal indexing

I have a very specific question, this is part of a job interview test.
I have this table:
CREATE TABLE Teszt
(
Id INT NOT NULL
, Name NVARCHAR(100)
, [Description] NVARCHAR(MAX)
, Value DECIMAL(20,4)
, IsEnabled BIT
)
And these selects:
SELECT Name
FROM Teszt
WHERE Id = 10
SELECT Id, Value
FROM Teszt
WHERE IsEnabled = 1
SELECT [Description]
FROM Teszt
WHERE Name LIKE '%alma%'
SELECT [Description]
FROM Teszt
WHERE Value > 1000 AND IsEnabled = 1
SELECT Id, Name
FROM Teszt
WHERE IsEnabled = 1
The question is, where on this table should I put indexes to optimize the performance of the above queries. No other info on the table was provided, so my answer will contain the general pro/contra arguments for indexes, but I'm not sure regarding the above queries.
My thoughts on optimizing these specific queries with indexes:
Id should probably have an index, looks like the primary key and it is part of a where clause;
Creating one on the Value column would also be good, as its part of a where clause here;
Now it gets murky for me. for the Name column, based on just the above queries, I probably shouldn't create one, as it is used with LIKE, which defeats the purpose of an index, right?
I tried to read everything on indexing a bit column (isEnabled column in the table), but I couldn't say it's any clearer to me, as the arguments are wildly ranging. should I create an index on it? should that be filtered? should it be part of a separate index, or just part of one with the other columns?
Again, this is all theoretical, so no info on the size or the usage of the table.
Thanks in advance for any answer!
Regards,
Tom
An index on a bit column is generally not recommended. The following discussion applies not only to bit columns but to any low-cardinality value. In English, "low-cardinality" means the column takes on only a handful of values.
The reason is simple. A bit column takes on three values (if you include NULL). That means that a typical selection on the column would return about a third of the rows. A third of the rows means that you would (typically) be accessing every data page. If so, you might as well do a full table scan.
So, let's ask the question explicitly: When is an index on a bit index useful or appropriate?
First, the above argument does not work if you are always looking for IsEnabled = 1 and, say, 0.001% of the rows are enabled. This is a highly selective query and an index could help. Note: The index would not help on IsEnabled = 0 in this scenario.
Second, the above argument argues in favor of a clustered index on the bit value. If the values are clustered, then even a 30% selectivity means that you are only reading 30% of the rows. The downside is that updating the value means moving the record from one data page to another (a somewhat expensive operation).
Third, a bit column can constructively be part of a larger index. This is especially true of a clustered index with the bit being first. For instance, for the fourth query, one could argue that a clustered index on (IsEnabled, Value, Description) would be the optimal index.
To be honest, though, I don't like playing around with clustered indexes. I prefer that the primary key be the clustered index. I admit that performance gains can be impressive for a narrow set of queries -- and if this is your use case, then use them (and accessing enabled rows might be a good reason for using them). However, the clustered index is something that you get to use only once, and primary keys are the best generic option to optimize joins.
You can read the detail about on how to create index from this article: https://msdn.microsoft.com/en-us/library/ms188783.aspx
As you said there are pros and cons using index.
Pros: Select query will be faster
Cons: Insert query will be slower
Conclusion: Add index if your table has less INSERT AND most SELECT operation.
In which Column I should consider adding index? This is really a very good question. Though I am not the DB expert, here are my views:
Add index on your primary key column
Add index on your join column [inner/outer/left]
Short answer: on Id and IsEnabled
(despite the controversy about indexing on BIT field; and Id should be primary key)
Generally, to optimize the performance, indexes should be on fields, where there is WHERE or JOIN. (Under the hood) To make the selection, the db server looks for index, and if not found -- creates one on-the-fly in the memory, which takes time, hence performance degradation.
As Bhuwan noted, indexes are "bad" for INSERTs (keep that in mind for the whole picture when designing a database), but there are just SELECTs in the example privided.
Hope you passed the test :)
-Nick
tldr: I will probably delete this later, so no need!
My answer to this job interview question: "It depends." ... and then I would probably spend too much of the interview talking about how terrible the question is.
The problem is that this is simply a bad question for a "job interview test". I have been poking at this for two hours now, and the longer I spend the more annoyed I get.
With absolutely no information on the content of the table, we can not guarantee that this table is even in the first normal form or better, so we can not even assume that the only non nullable column Id is a valid primary key.
With no idea about the content of the table, we do not even know if it needs indexes. If it has only a few rows, then the entire page will sit in memory and whichever operations you are running against it will be fast enough.
With no cardinality information we do not know if a value > 1000 is common or uncommon. All or none of the values could be greater than 1000, but we do not know.
With no cardinality information we do not know if IsEnabled = 1 would mean 99% of rows, or even 0% of rows.
I would say you are on the right track as far as your thought process for evaluating indexing, but the trick is that you are drawing from your experiences with indexes you needed on tables before this table. Applying assumptions based on general previous experience is fine, but you should always test them. In this case, blindly applying general practices could be a mistake.
The question is, where on this table should I put indexes to optimize the performance of the above queries. No other info on the table was provided
If I try to approach this from another position: Nothing else matters except the performance of these five queries, I would apply these indexes:
create index ixf_Name on dbo.Teszt(Name)
include (Id)
where id = 10;
create index ixf_Value_Enabled on dbo.Teszt(Value)
include (Id)
where IsEnabled = 1;
create index ixf_Value_gt1k_Enabled on dbo.Teszt(Id)
include (description,value,IsEnabled)
where Value > 1000 and IsEnabled = 1;
create index ixf_Name_Enabled on dbo.Teszt(Id)
include (Name, IsEnabled)
where IsEnabled = 1;
create index ixf_Name_notNull on dbo.Teszt(Name)
include (Description)
where Name is not null;
Also, the decimal(20,4) annoys me because this is the least amount of data you can store in the 13 bytes of space it takes up. decimal(28,4) has the same storage size and if it could have been decimal(19,4) then it would have been only 9 bytes. Granted this is a silly thing to be annoyed about, especially considering the table is going to be wide anyway, but I thought I would point it out anyway.

How to know when to use indexes and which type?

I've searched a bit and didn't see any similar question, so here goes.
How do you know when to put an index in a table? How do you decide which columns to include in the index? When should a clustered index be used?
Can an index ever slow down the performance of select statements? How many indexes is too many and how big of a table do you need for it to benefit from an index?
EDIT:
What about column data types? Is it ok to have an index on a varchar or datetime?
Well, the first question is easy:
When should a clustered index be used?
Always. Period. Except for a very few, rare, edge cases. A clustered index makes a table faster, for every operation. YES! It does. See Kim Tripp's excellent The Clustered Index Debate continues for background info. She also mentions her main criteria for a clustered index:
narrow
static (never changes)
unique
if ever possible: ever increasing
INT IDENTITY fulfills this perfectly - GUID's do not. See GUID's as Primary Key for extensive background info.
Why narrow? Because the clustering key is added to each and every index page of each and every non-clustered index on the same table (in order to be able to actually look up the data row, if needed). You don't want to have VARCHAR(200) in your clustering key....
Why unique?? See above - the clustering key is the item and mechanism that SQL Server uses to uniquely find a data row. It has to be unique. If you pick a non-unique clustering key, SQL Server itself will add a 4-byte uniqueifier to your keys. Be careful of that!
Next: non-clustered indices. Basically there's one rule: any foreign key in a child table referencing another table should be indexed, it'll speed up JOINs and other operations.
Furthermore, any queries that have WHERE clauses are a good candidate - pick those first which are executed a lot. Put indices on columns that show up in WHERE clauses, in ORDER BY statements.
Next: measure your system, check the DMV's (dynamic management views) for hints about unused or missing indices, and tweak your system over and over again. It's an ongoing process, you'll never be done! See here for info on those two DMV's (missing and unused indices).
Another word of warning: with a truckload of indices, you can make any SELECT query go really really fast. But at the same time, INSERTs, UPDATEs and DELETEs which have to update all the indices involved might suffer. If you only ever SELECT - go nuts! Otherwise, it's a fine and delicate balancing act. You can always tweak a single query beyond belief - but the rest of your system might suffer in doing so. Don't over-index your database! Put a few good indices in place, check and observe how the system behaves, and then maybe add another one or two, and again: observe how the total system performance is affected by that.
Rule of thumb is primary key (implied and defaults to clustered) and each foreign key column
There is more but you could do worse than using SQL Server's missing index DMVs
An index may slow down a SELECT if the optimiser makes a bad choice, and it is possible to have too many. Too many will slow writes but it's also possible to overlap indexes
Answering the ones I can I would say that every table, no matter how small, will always benefit from at least one index as there has to be at least one way in which you are interested in looking up the data; otherwise why store it?
A general rule for adding indexes would be if you need to find data in the table using a particular field, or set of fields. This leads on to how many indexes are too many, generally the more indexes you have the slower inserts and updates will be as they also have to modify the indexes but it all depends on how you use your data. If you need fast inserts then don't use too many. In reporting "read only" type data stores you can have a number of them to make all your lookups faster.
Unfortunately there is no one rule to guide you on the number or type of indexes to use, although the query optimiser of your chosen DB can give hints based on the queries you are executing.
As to clustered indexes they are the Ace card you only get to use once, so choose carefully. It's worth calculating the selectivity of the field you are thinking of putting it on as it can be wasted to put it on something like a boolean field (contrived example) as the selectivity of the data is very low.
This is really a very involved question, though a good starting place would be to index any column that you will filter results on. ie. If you often break products into groups by sale price, index the sale_price column of the products table to improve scan times for that query, etc.
If you are querying based on the value in a column, you probably want to index that column.
i.e.
SELECT a,b,c FROM MyTable WHERE x = 1
You would want an index on X.
Generally, I add indexes for columns which are frequently queried, and I add compound indexes when I'm querying on more than one column.
Indexes won't hurt the performance of a SELECT, but they may slow down INSERTS (or UPDATES) if you have too many indexes columns per table.
As a rule of thumb - start off by adding indexes when you find yourself saying WHERE a = 123 (in this case, an index for "a").
You should use an index on columns that you use for selection and ordering - i.e. the WHERE and ORDER BY clauses.
Indexes can slow down select statements if there are many of them and you are using WHERE and ORDER BY on columns that have not been indexed.
As for size of table - several thousands rows and upwards would start showing real benefits to index usage.
Having said that, there are automated tools to do this, and SQL server has an Database Tuning Advisor that will help with this.

Does an index on a unique field in a table allow a select count(*) to happen instantly? If not why not?

I know just enough about SQL tuning to get myself in trouble. Today I was doing EXPLAIN plan on a query and I noticed it was not using indexes when I thought it probably should. Well, I kept doing EXPLAIN on simpler and simpler (and more indexable in my mind) queries, until I did EXPLAIN on
select count(*) from table_name
I thought for sure this would return instantly and that the explain would show use of an index, as we have many indexes on this table, including an index on the row_id column, which is unique. Yet the explain plan showed a FULL table scan, and it took several seconds to complete. (We have 3 million rows in this table).
Why would oracle be doing a full table scan to count the rows in this table? I would like to think that since oracle is indexing unique fields already, and having to track every insert and update on that table, that it would be caching the row count somewhere. Even if it's not, wouldn't it be faster to scan the entire index than to scan the entire table?
I have two theories. Theory one is that I am imagining how indexes work incorrectly. Theory two is that some setting or parameter somewhere in our oracle setup is messing with Oracle's ability to optimize queries (we are on oracle 9i). Can anyone enlighten me?
Oracle does not cache COUNT(*).
MySQL with MyISAM does (can afford this), because MyISAM is transactionless and same COUNT(*) is visible by anyone.
Oracle is transactional, and a row deleted in other transaction is still visible by your transaction.
Oracle should scan it, see that it's deleted, visit the UNDO, make sure it's still in place from your transaction's point of view, and add it to the count.
Indexing a UNIQUE value differs from indexing a non-UNIQUE one only logically.
In fact, you can create a UNIQUE constraint over a column with a non-unique index defined, and the index will be used to enforce the constraint.
If a column is marked as non-NULL, the an INDEX FAST FULL SCAN over this column can be used for COUNT.
It's a special access method, used for cases when the index order is not important. It does not traverse the B-Tree, but instead just reads the pages sequentially.
Since an index has less pages than the table itself, the COUNT can be faster with an INDEX_FFS than with a FULL
It is certainly possible for Oracle to satisfy such a query with an index (specifically with an INDEX FAST FULL SCAN).
In order for the optimizer to choose that path, at least two things have to be true:
Oracle has to be certain that every row in the table is represented in the index -- basically, that there are no NULL entries that would be missing from the index. If you have a primary key this should be guaranteed.
Oracle has to calculate the cost of the index scan as lower than the cost of a table scan. I don't think it necessarily true to assume that an index scan is always cheaper.
Possibly, gathering statistics on the table would change the behavior.
Expanding a little on the "transactions" reason. When a database supports transactions, at any point in time there might be records in different states, even in a "deleted" state. If a transaction fails, the states are rolled back.
A full table scan is done so that the current "version" of each record can be accessed for that point in time.
MySQL MyISAM doesn't have this problem since it uses table locking, instead of record locking required for transactions, and caches the record count. So it's always instantlyy returned. InnoDB under MySQL works the same as Oracle, but returns and "estimate".
You may be able to get a quicker query by counting the distinct values on the primary key, then only the index would be accessed.

Indexing nulls for fast searching on DB2

It's my understanding that nulls are not indexable in DB2, so assuming we have a huge table (Sales) with a date column (sold_on) which is normally a date, but is occasionally (10% of the time) null.
Furthermore, let's assume that it's a legacy application that we can't change, so those nulls are staying there and mean something (let's say sales that were returned).
We can make the following query fast by putting an index on the sold_on and total columns
Select * from Sales
where
Sales.sold_on between date1 and date2
and Sales.total = 9.99
But an index won't make this query any faster:
Select * from Sales
where
Sales.sold_on is null
and Sales.total = 9.99
Because the indexing is done on the value.
Can I index nulls? Maybe by changing the index type? Indexing the indicator column?
From where did you get the impression that DB2 doesn't index NULLs? I can't find anything in documentation or articles supporting the claim. And I just performed a query in a large table using a IS NULL restriction involving an indexed column containing a small fraction of NULLs; in this case, DB2 certainly used the index (verified by an EXPLAIN, and by observing that the database responded instantly instead of spending time to perform a table scan).
So: I claim that DB2 has no problem with NULLs in non-primary key indexes.
But as others have written: Your data may be composed in a way where DB2 thinks that using an index will not be quicker. Or the database's statistics aren't up-to-date for the involved table(s).
I'm no DB2 expert, but if 10% of your values are null, I don't think an index on that column alone will ever help your query. 10% is too many to bother using an index for -- it'll just do a table scan. If you were talking about 2-3%, I think it would actually use your index.
Think about how many records are on a page/block -- say 20. The reason to use an index is to avoid fetching pages you don't need. The odds that a given page will contain 0 records that are null is (90%)^20, or 12%. Those aren't good odds -- you're going to need 88% of your pages to be fetched anyway, using the index isn't very helpful.
If, however, your select clause only included a few columns (and not *) -- say just salesid, you could probably get it to use an index on (sold_on,salesid), as the read of the data page wouldn't be needed -- all the data would be in the index.
The rule of thumb is that an index is useful for values up on to 15% of the records. ... so an index might be useful here.
If DB2 won't index nulls, then I would suggest adding a boolean field, IsSold, and set it to true whenever the sold_on date gets set (this could be done in a trigger).
That's not the nicest solution, but it might be what you need.
Troels is correct; even rows with a SOLD_ON value of NULL will benefit from an index on that column. If you're doing ranged searches on SOLD_ON, you may benefit even more by creating a clustered index that begins with SOLD_ON. In this particular example, it may not require much additional overhead to maintain the clustering order based on SOLD_ON, since newer rows added will most likely have a newer SOLD_ON date.

Does limiting a query to one record improve performance

Will limiting a query to one result record, improve performance in a large(ish) MySQL table if the table only has one matching result?
for example
select * from people where name = "Re0sless" limit 1
if there is only one record with that name? and what about if name was the primary key/ set to unique? and is it worth updating the query or will the gain be minimal?
If the column has
a unique index: no, it's no faster
a non-unique index: maybe, because it will prevent sending any additional rows beyond the first matched, if any exist
no index: sometimes
if 1 or more rows match the query, yes, because the full table scan will be halted after the first row is matched.
if no rows match the query, no, because it will need to complete a full table scan
If you have a slightly more complicated query, with one or more joins, the LIMIT clause gives the optimizer extra information. If it expects to match two tables and return all rows, a hash join is typically optimal. A hash join is a type of join optimized for large amounts of matching.
Now if the optimizer knows you've passed LIMIT 1, it knows that it won't be processing large amounts of data. It can revert to a loop join.
Based on the database (and even database version) this can have a huge impact on performance.
To answer your questions in order:
1) yes, if there is no index on name. The query will end as soon as it finds the first record. take off the limit and it has to do a full table scan every time.
2) no. primary/unique keys are guaranteed to be unique. The query should stop running as soon as it finds the row.
I believe the LIMIT is something done after the data set is found and the result set is being built up so I wouldn't expect it to make any difference at all. Making name the primary key will have a significant positive effect though as it will result in an index being made for the column.
If "name" is unique in the table, then there may still be a (very very minimal) gain in performance by putting the limit constraint on your query. If name is the primary key, there will likely be none.
Yes, you will notice a performance difference when dealing with the data. One record takes up less space than multiple records. Unless you are dealing with many rows, this would not be much of a difference, but once you run the query, the data has to be displayed back to you, which is costly, or dealt with programmatically. Either way, one record is easier than multiple.