SQL Server table optimal indexing - sql

I have a very specific question, this is part of a job interview test.
I have this table:
CREATE TABLE Teszt
(
Id INT NOT NULL
, Name NVARCHAR(100)
, [Description] NVARCHAR(MAX)
, Value DECIMAL(20,4)
, IsEnabled BIT
)
And these selects:
SELECT Name
FROM Teszt
WHERE Id = 10
SELECT Id, Value
FROM Teszt
WHERE IsEnabled = 1
SELECT [Description]
FROM Teszt
WHERE Name LIKE '%alma%'
SELECT [Description]
FROM Teszt
WHERE Value > 1000 AND IsEnabled = 1
SELECT Id, Name
FROM Teszt
WHERE IsEnabled = 1
The question is, where on this table should I put indexes to optimize the performance of the above queries. No other info on the table was provided, so my answer will contain the general pro/contra arguments for indexes, but I'm not sure regarding the above queries.
My thoughts on optimizing these specific queries with indexes:
Id should probably have an index, looks like the primary key and it is part of a where clause;
Creating one on the Value column would also be good, as its part of a where clause here;
Now it gets murky for me. for the Name column, based on just the above queries, I probably shouldn't create one, as it is used with LIKE, which defeats the purpose of an index, right?
I tried to read everything on indexing a bit column (isEnabled column in the table), but I couldn't say it's any clearer to me, as the arguments are wildly ranging. should I create an index on it? should that be filtered? should it be part of a separate index, or just part of one with the other columns?
Again, this is all theoretical, so no info on the size or the usage of the table.
Thanks in advance for any answer!
Regards,
Tom

An index on a bit column is generally not recommended. The following discussion applies not only to bit columns but to any low-cardinality value. In English, "low-cardinality" means the column takes on only a handful of values.
The reason is simple. A bit column takes on three values (if you include NULL). That means that a typical selection on the column would return about a third of the rows. A third of the rows means that you would (typically) be accessing every data page. If so, you might as well do a full table scan.
So, let's ask the question explicitly: When is an index on a bit index useful or appropriate?
First, the above argument does not work if you are always looking for IsEnabled = 1 and, say, 0.001% of the rows are enabled. This is a highly selective query and an index could help. Note: The index would not help on IsEnabled = 0 in this scenario.
Second, the above argument argues in favor of a clustered index on the bit value. If the values are clustered, then even a 30% selectivity means that you are only reading 30% of the rows. The downside is that updating the value means moving the record from one data page to another (a somewhat expensive operation).
Third, a bit column can constructively be part of a larger index. This is especially true of a clustered index with the bit being first. For instance, for the fourth query, one could argue that a clustered index on (IsEnabled, Value, Description) would be the optimal index.
To be honest, though, I don't like playing around with clustered indexes. I prefer that the primary key be the clustered index. I admit that performance gains can be impressive for a narrow set of queries -- and if this is your use case, then use them (and accessing enabled rows might be a good reason for using them). However, the clustered index is something that you get to use only once, and primary keys are the best generic option to optimize joins.

You can read the detail about on how to create index from this article: https://msdn.microsoft.com/en-us/library/ms188783.aspx
As you said there are pros and cons using index.
Pros: Select query will be faster
Cons: Insert query will be slower
Conclusion: Add index if your table has less INSERT AND most SELECT operation.
In which Column I should consider adding index? This is really a very good question. Though I am not the DB expert, here are my views:
Add index on your primary key column
Add index on your join column [inner/outer/left]

Short answer: on Id and IsEnabled
(despite the controversy about indexing on BIT field; and Id should be primary key)
Generally, to optimize the performance, indexes should be on fields, where there is WHERE or JOIN. (Under the hood) To make the selection, the db server looks for index, and if not found -- creates one on-the-fly in the memory, which takes time, hence performance degradation.
As Bhuwan noted, indexes are "bad" for INSERTs (keep that in mind for the whole picture when designing a database), but there are just SELECTs in the example privided.
Hope you passed the test :)
-Nick

tldr: I will probably delete this later, so no need!
My answer to this job interview question: "It depends." ... and then I would probably spend too much of the interview talking about how terrible the question is.
The problem is that this is simply a bad question for a "job interview test". I have been poking at this for two hours now, and the longer I spend the more annoyed I get.
With absolutely no information on the content of the table, we can not guarantee that this table is even in the first normal form or better, so we can not even assume that the only non nullable column Id is a valid primary key.
With no idea about the content of the table, we do not even know if it needs indexes. If it has only a few rows, then the entire page will sit in memory and whichever operations you are running against it will be fast enough.
With no cardinality information we do not know if a value > 1000 is common or uncommon. All or none of the values could be greater than 1000, but we do not know.
With no cardinality information we do not know if IsEnabled = 1 would mean 99% of rows, or even 0% of rows.
I would say you are on the right track as far as your thought process for evaluating indexing, but the trick is that you are drawing from your experiences with indexes you needed on tables before this table. Applying assumptions based on general previous experience is fine, but you should always test them. In this case, blindly applying general practices could be a mistake.
The question is, where on this table should I put indexes to optimize the performance of the above queries. No other info on the table was provided
If I try to approach this from another position: Nothing else matters except the performance of these five queries, I would apply these indexes:
create index ixf_Name on dbo.Teszt(Name)
include (Id)
where id = 10;
create index ixf_Value_Enabled on dbo.Teszt(Value)
include (Id)
where IsEnabled = 1;
create index ixf_Value_gt1k_Enabled on dbo.Teszt(Id)
include (description,value,IsEnabled)
where Value > 1000 and IsEnabled = 1;
create index ixf_Name_Enabled on dbo.Teszt(Id)
include (Name, IsEnabled)
where IsEnabled = 1;
create index ixf_Name_notNull on dbo.Teszt(Name)
include (Description)
where Name is not null;
Also, the decimal(20,4) annoys me because this is the least amount of data you can store in the 13 bytes of space it takes up. decimal(28,4) has the same storage size and if it could have been decimal(19,4) then it would have been only 9 bytes. Granted this is a silly thing to be annoyed about, especially considering the table is going to be wide anyway, but I thought I would point it out anyway.

Related

Index on multiple bit fields in SQL Server

We currently have a scenario where one table effectively has several (10 to 15) boolean flags (not nullable bit fields). Unfortunately, it is not really possible to simplify this too much on a logical level, because any combination of the boolean values is permissible.
The table in question is a transactional table which may end up having tens of millions of rows, and both insert and select performance is fairly critical. Although we are not quite sure of the distribution of the data at this time, the combination of all flags should provide relative good cardinality, i.e. make it a "worthwhile" index for SQL Server to make use of.
Typical select query scenarios might be to select records based on 3 or 4 of the flags only, e.g. WHERE FLAG3=1 AND FLAG7=0 AND FLAG9=1. It would not be practical to create separate indexes for all combinations of the flags used by these select queries, as there will be many of them.
Given this situation, what would be the recommended approach to effectively index these fields? The table is new, so there is no existing data to worry about yet, and we have a fair amount of flexibility in the actual implementation of the table.
There are two main options that we are considering at the moment:
Create a single index which includes all the bit fields (this would probably include 1 or 2 other int fields which would always used). My concern is that given the typical usage of only including a few of the fields, this approach would skip the index and resort to a table scan. Let's call this Option A (Having read some of the replies, it seems that this approach would not work well, since the order of the fields in the index would make a difference making it impossible to index effectively on ALL the fields).
Effectively do what I believe SQL Server is doing internally, and encode the bit fields into a single int field using binary operators (AND-ing and OR-ing numbers together: 1, 2, 4, 8, etc). My concern here is that we'd need to do some kind of calculation to query on this encoded field, which would skip the index again. Maintenance and the complexity of this solution is also a concern. Let's call this Option B. Additional info: The argument for this approach is that we could have a relatively simple and short index which includes one or two other fields from the table and this field. The other fields would narrow down the number of records needing to be evaluated, and since the encoded field would contain all of our bit fields, SQL Server would be able to perform the calculation using the data retrieved from the index directly (i.e. an index scan) as opposed to the table (i.e. a table scan).
At the moment, we are heavily leaning towards Option B. For completeness, this would be running on SQL Server 2008.
Any advice would be greatly appreciated.
Edit: Spelling, clarity, query example, additional info on Option B.
A single BIT column typically is not selective enough to be even considered for use in an index. So an index on a single BIT column really doesn't make sense - on average, you'd always have to search about half the entries in the table (50% selectiveness) and so the SQL Server query optimizer will instead use a table scan.
If you create a single index on all 15 bit columns, then you don't have that problem - since you have 15 yes/no options, your index will become quite selective.
Trouble is: the sequence of the bit columns is important. Your index will only ever be considered if your SQL statement uses at least 1-n of the left-most BIT columns.
So if your index is on
Col1,Col2,Col3,....,Col14,Col15
then it might be used for a query that uses
Col1
Col1 and Col2
Col1 and Col2 and Col3
....
and so on. But it cannot be used for a query that specifies Col6,Col9 and Col14.
Because of that, I don't really think an index on your collection of BIT columns really makes a lot of sense.
Are those 15 BIT columns the only columns you use for querying? If not, I would try to combine those BIT columns that you use most for selection with other columns, e.g. have an index on Name and Col7 or something (then your BIT columns can add some additional selectivity to another index)
Whilst there are probably ways to solve your indexing problem against your existing table schema, I would reduce this to a normalisation problem:
e.g I would highly recommend creating a series of new tables:
Lookup table for the names of this bit flags. e.g. CREATE TABLE Flags (id int IDENTITY(1,1), Name varchar(256)) (you don't have to make id an identity-seed column if you want to manually control the id's - e.g. 2,4,8,16,32,64,128 as binary flags.)
Create a new link-table that contains the id's of the original data table and the new link table e.g. CREATE TABLE DataFlags_Link (id int IDENTITY(1,1), MyFlagId int, DataId int)
You could then create an index on the DataFlags_Link table and write queries like:
SELECT Data.*
FROM Data
INNER JOIN DataFlags_Link ON Data.id = DataFlags_Link.DataId
WHERE DataFlags_Link.MyFlagId IN (4,7,2,8)
As for performance, that's where good DBA maintenance comes in. You'll want to set the INDEX fill-factor and padding on your tables appropriately and run regular index defragmentation or rebuild your indexes on a schedule.
Performance and maintenance go hand-in-hand with databases. You can't have one without the other.
Whilst I think Neil Fenwick's answer is probably right, I think the real answer is to try out the different options and see which one is fast enough.
Option 1 is probably the most straightforward solution, and therefore likely the most maintainable - and it may well be fast enough.
I would build a prototype database, with the "option 1" schema, and use something like http://www.red-gate.com/products/sql-development/sql-data-generator/ or http://sourceforge.net/projects/dbmonster/ to create twice as much data as you expect to need, and then build the queries you expect to need. Agree an acceptable response time, and only consider a "faster" schema if you exceed those response times (and you can't throw hardware at the problem).
Neil's solution is probably just as obvious and maintainable as "option 1" - and it should be easy to index. However, I'd still test it by creating a prototype schema and generating a lot of test data...

Why can't I simply add an index that includes all columns?

I have a table in SQL Server database which I want to be able to search and retrieve data from as fast as possible. I don't care about how long time it takes to insert into the table, I am only interested in the speed at which I can get data.
The problem is the table is accessed with 20 or more different types of queries. This makes it a tedious task to add an index specially designed for each query. I'm considering instead simply adding an index that includes ALL columns of the table. It's not something you would normally do in "good" database design, so I'm assuming there is some good reason why I shouldn't do it.
Can anyone tell me why I shouldn't do this?
UPDATE: I forgot to mention, I also don't care about the size of my database. It's OK that it means my database size will grow larger than it needed to
First of all, an index in SQL Server can only have at most 900 bytes in its index entry. That alone makes it impossible to have an index with all columns.
Most of all: such an index makes no sense at all. What are you trying to achieve??
Consider this: if you have an index on (LastName, FirstName, Street, City), that index will not be able to be used to speed up queries on
FirstName alone
City
Street
That index would be useful for searches on
(LastName), or
(LastName, FirstName), or
(LastName, FirstName, Street), or
(LastName, FirstName, Street, City)
but really nothing else - certainly not if you search for just Street or just City!
The order of the columns in your index makes quite a difference, and the query optimizer can't just use any column somewhere in the middle of an index for lookups.
Consider your phone book: it's order probably by LastName, FirstName, maybe Street. So does that indexing help you find all "Joe's" in your city? All people living on "Main Street" ?? No - you can lookup by LastName first - then you get more specific inside that set of data. Just having an index over everything doesn't help speed up searching for all columns at all.
If you want to be able to search by Street - you need to add a separate index on (Street) (and possibly another column or two that make sense).
If you want to be able to search by Occupation or whatever else - you need another specific index for that.
Just because your column exists in an index doesn't mean that'll speed up all searches for that column!
The main rule is: use as few indices as possible - too many indices can be even worse for a system than having no indices at all.... build your system, monitor its performance, and find those queries that cost the most - then optimize these, e.g. by adding indices.
Don't just blindly index every column just because you can - this is a guarantee for lousy system performance - any index also requires maintenance and upkeep, so the more indices you have, the more your INSERT, UPDATE and DELETE operations will suffer (get slower) since all those indices need to be updated.
You are having a fundamental misunderstanding how indexes work.
Read this explanation "how multi-column indexes work".
The next question you might have is why not creating one index per column--but that's also a dead-end if you try to reach top select performance.
You might feel that it is a tedious task, but I would say it's a required task to index carefully. Sloppy indexing strikes back, as in this example.
Note: I am strongly convinced that proper indexing pays off and I know that many people are having the very same questions you have. That's why I'm writing a the a free book about it. The links above refer the pages that might help you to answer your question. However, you might also want to read it from the beginning.
...if you add an index that contains all columns, and a query was actually able to use that index, it would scan it in the order of the primary key. Which means hitting nearly every record. Average search time would be O(n/2).. the same as hitting the actual database.
You need to read a bit lot about indexes.
It might help if you consider an index on a table to be a bit like a Dictionary in C#.
var nameIndex = new Dictionary<String, List<int>>();
That means that the name column is indexed, and will return a list of primary keys.
var nameOccupationIndex = new Dictionary<String, List<Dictionary<String, List<int>>>>();
That means that the name column + occupation columns are indexed. Now imagine the index contained 10 different columns, nested so far deep it contains every single row in your table.
This isn't exactly how it works mind you. But it should give you an idea of how indexes could work if implemented in C#. What you need to do is create indexes based on one or two keys that are queried on extensively, so that the index is more useful than scanning the entire table.
If this is a data warehouse type operation where queries are highly optimized for READ queries, and if you have 20 ways of dissecting the data, e.g.
WHERE clause involves..
Q1: status, type, customer
Q2: price, customer, band
Q3: sale_month, band, type, status
Q4: customer
etc
And you absolutely have plenty of fast storage space to burn, then by all means create an index for EVERY single column, separately. So a 20-column table will have 20 indexes, one for each individual column. I could probably say to ignore bit columns or low cardinality columns, but since we're going so far, why bother (with that admonition). They will just sit there and churn the WRITE time, but if you don't care about that part of the picture, then we're all good.
Analyze your 20 queries, and if you have hot queries (the hottest ones) that still won't go any faster, plan it using SSMS (press Ctrl-L) with one query in the query window. It will tell you what index can help that queries - just create it; create them all, fully remembering that this adds again to the write cost, backup file size, db maintenance time etc.
I think the questioner is asking
'why can't I make an index like':
create index index_name
on table_name
(
*
)
The problems with that have been addressed.
But given it sounds like they are using MS sql server.
It's useful to understand that you can include nonkey columns in an index so they the values of those columns are available for retrieval from the index, but not to be used as selection criteria :
create index index_name
on table_name
(
foreign_key
)
include (a,b,c,d) -- every column except foreign key
I created two tables with a million identical rows
I indexed table A like this
create nonclustered index index_name_A
on A
(
foreign_key -- this is a guid
)
and table B like this
create nonclustered index index_name_B
on B
(
foreign_key -- this is a guid
)
include (id,a,b,c,d) -- ( every key except foreign key)
no surprise, table A was slightly faster to insert to.
but when I and ran these this queries
select * from A where foreign_key = #guid
select * from B where foreign_key = #guid
On table A, sql server didn't even use the index, it did a table scan, and complained about a missing index including id,a,b,c,d
On table B, the query was over 50 times faster with much less io
forcing the query on A to use the index didn't make it any faster
select * from A where foreign_key = #guid
select * from A with (index(index_name_A)) where foreign_key = #guid
I'm considering instead simply adding an index that includes ALL columns of the table.
This is always a bad idea. Indexes in database is not some sort of pixie dust that works magically. You have to analyze your queries and according to what and how is being queried - append indexes.
It is not as simple as "add everything to index and have a nap"
I see only long and complicated answers here so I thought I should give the simplest answer possible.
You cannot add an entire table, or all its columns, to an index because that just duplicates the table.
In simple terms, an index is just another table with selected data ordered in the order you normally expect to query it in, and a pointer to the row on disk where the rest of the data lives.
So, a level of indirection exists. You have a partial copy of a table in an preordered manner (both on disk and in RAM, assuming the index is not fragmented), which is faster to query for the columns defined in the index only, while the rest of the columns can be fetched without having to scan the disk for them, because the index contains a reference to the correct position on disk where the rest of the data is for each row.
1) size, an index essentially builds a copy of the data in that column some easily searchable structure, like a binary tree (I don't know SQL Server specifcs).
2) You mentioned speed, index structures are slower to add to.
That index would just be identical to your table (possibly sorted in another order).
It won't speed up your queries.

Table index design

I would like to add index(s) to my table.
I am looking for general ideas how to add more indexes to a table.
Other than the PK clustered.
I would like to know what to look for when I am doing this.
So, my example:
This table (let's call it TASK table) is going to be the biggest table of the whole application. Expecting millions records.
IMPORTANT: massive bulk-insert is adding data in this table
table has 27 columns: (so far, and counting :D )
int x 9 columns = id-s
varchar x 10 columns
bit x 2 columns
datetime x 5 columns
INT COLUMNS
all of these are INT ID-s but from tables that are usually smaller than Task table (10-50 records max), example: Status table (with values like "open", "closed") or Priority table (with values like "important", "not so important", "normal")
there is also a column like "parent-ID" (self - ID)
join: all the "small" tables have PK, the usual way ... clustered
STRING COLUMNS
there is a (Company) column (string!) that is something like "5 characters long all the time" and every user will be restricted using this one. If in Task there are 15 different "Companies" the logged in user would only see one. So there's always a filter on this one. Might be a good idea to add an index to this column?
DATE COLUMNS
I think they don't index these ... right? Or can / should be?
I wouldn't add any indices - unless you have specific reasons to do so, e.g. performance issues.
In order to figure out what kind of indices to add, you need to know:
what kind of queries are being used against your table - what are the WHERE clauses, what kind of ORDER BY are you doing?
how is your data distributed? Which columns are selective enough (< 2% of the data) to be useful for indexing
what kind of (negative) impact do additional indices have on your INSERTs and UPDATEs on the table
any foreign key columns should be part of an index - preferably as the first column of the index - to speed up JOINs to other tables
And sure you can index a DATETIME column - what made you think you cannot?? If you have a lot of queries that will restrict their result set by means of a date range, it can make total sense to index a DATETIME column - maybe not by itself, but in a compound index together with other elements of your table.
What you cannot index are columns that hold more than 900 bytes of data - anything like VARCHAR(1000) or such.
For great in-depth and very knowledgeable background on indexing, consult the blog by Kimberly Tripp, Queen of Indexing.
in general an index will speed up a JOIN, a sort operation and a filter
SO if the columns are in the JOIN, the ORDER BY or the WHERE clause then an index will help in terms of performance...but there is always a but...with every index that you add UPDATE, DELETE and INSERT operations will be slowed down because the indexes have to be maintained
so the answer is...it depends
I would say start hitting the table with queries and look at the execution plans for scans, try to make those seeks by either writing SARGable queries or adding indexes if needed...don't just add indexes for the sake of adding indexes
Step one is to understand how the data in the table will be used: how will it be inserted, selected, updated, deleted. Without knowing your usage patterns, you're shooting in the dark. (Note also that whatever you come up with now, you may be wrong. Be sure to compare your decisions with actual usage patterns once you're up and running.) Some ideas:
If users will often be looking up individual items in the table, an index on the primary key is critical.
If data will be inserted with great frequency and you have multiple indexes, over time you well have to deal with index fragmentation. Read up on and understand clustered and non-clustered indexes and fragmentation (ALTER INDEX...REBUILD).
But, if performance is key in situations when you need to retrieve a lot of rows, you might consider using your clustered indexe to support that.
If you often want a set of data based on Status, indexing on that column can be good--particularly if 1% of your rows are "Active" vs. 99% "Not Active", and all you want are the active ones.
Conversely, if your "PriorityId" is only used to get the "label" stating what PriorityId 42 is (i.e. join into the lookup table), you probably don't need an index on it in your main table.
A last idea, if everyone will always retrieve data for only one Company at a time, then (a) you'll definitely want to index on that, and (b) you might want to consider partitioning the table on that value, as it can act as a "built in filter" above and beyond conventional indexing. (This is perhaps a bit extreme and it's only available in Enterprise edition, but it may be worth it in your case.)

Database Design: replace a boolean column with a timestamp column?

Earlier I have created tables this way:
create table workflow (
id number primary key,
name varchar2(100 char) not null,
is_finished number(1) default 0 not null,
date_finished date
);
Column is_finished indicates whether the workflow finished or not. Column date_finished is when the workflow was finished.
Then I had the idea "I don't need is_finished as I can just say: where data_finished is not null", and I designed without is_finished column:
create table workflow (
id number primary key,
name varchar2(100 char) not null,
date_finished date
);
(We use Oracle 10)
Is it a good or bad idea? I've heard you can not have an index on a column with NULL values, so where data_finished is not null will be very slow on big tables.
Is it a good or bad idea?
Good idea.
You've eliminated space taken by a redundant column; the DATE column serves double duty--you know the work was finished, and when.
I've heard like you can't have an index on a column with NULL values, so "where data_finished is not null" will be very slow on big tables.
That's incorrect. Oracle indexes ignore NULL values.
You can create a function based index in order to get around the NULL values not being indexed, but most DBAs I've encountered really don't like them so be prepared for a fight.
There is a right way to index null values, and it doesn't use a FBI. Oracle will index null values, but it will NOT index null LEAF values in the tree. So, you could eliminate the column is_finished and create the index like this.
CREATE INDEX ON workflow (date_finished, 1);
Then, if you check the explain plan on this query:
SELECT count(*) FROM workflow WHERE date_finished is null;
You might see the index being used (if the optimizer is happy).
Back to the original question: looking at the variety of answers here, I think there is no right answer. I may have a personal preference to eliminate a column if it is unnecessary, but I also don't like overloading the meaning of columns either. There are two concepts here:
The record has finished. is_finished
The record finished on a particular date. date_finished
Maybe you need to keep these separate, maybe you don't. When I think about eliminating the is_finished column, it bothers me. Down the road, the situation may arise where the record finished, but you don't know precisely when. Perhaps you have to import data from another source and the date is unknown. Sure, that's not in the business requirements now, but things change. What do you do then? Well, you have to put some dummy value in the date_finished column, and now you've compromised the data a bit. Not horribly, but there is a rub there. The little voice in my head is shouting YOU'RE DOING IT WRONG when I do things like that.
My advice, keep it separate. You're talking about a tiny column and a very skinny index. Storage should not be an issue here.
Rule of Representation: Fold knowledge
into data so program logic can be
stupid and robust.
-Eric S. Raymond
To all those who said the column is a waste of space:
Double Duty isn't a good thing in a database. Your primary goal should be clarity. Lots of systems, tools, people will use your data. If you disguise values by burying meaning inside of other columns you're BEGGING for another system or user to get it wrong.
And anyone who thinks it saves space is utterly wrong.
You'll need two indexes on that date column... one will be Function Based as OMG suggests. It will look like this:
NVL(Date_finished, TO_DATE('01-JAN-9999'))
So to find unfinished jobs you'll have to make sure to write the where clause correctly
It will look like this:
WHERE
NVL(Date_finished, TO_DATE('01-JAN-9999')) = TO_DATE('01-JAN-9999')
Yep. That's so clear. It's completely better than
WHERE
IS_Unfinished = 'YES'
The reason you'll want to have a second index on the same column is for EVERY OTHER query on that date... you won't want to use that index for finding jobs by date.
So let's see what you've accomplish with OMG's suggestion et al.
You've used more space, you've obfuscated the meaning of the data, you've made errors more likely... WINNER!
Sometime it seems programmers are still living in the 70's when a MB of hard drive space was a down payment on a house.
You can be space efficient about this without giving up a lot of clarity. Make the Is_unfinished either Y or NULL... IF you will only use that column to find 'work to do'. This will keep that index compact. It will only be as big as rows which are unfinished (in this way you exploit the unindexed nulls instead of being screwed by it). You put a little bit of space in your table, but over all it's less than the FBI. You need 1 byte for the column and you'll only index the unfinished rows so that' a small fraction of job and probably stays pretty constant. The FBI will need 7 bytes for EVERY ROW whether you're trying to find them or not. That index will keep pace with the size of the table, not just the size of the unfinished jobs.
Reply to the comment by OMG
In his/her comment he/she states that to find unfinished jobs you'd just use
WHERE date_finished IS NULL
But in his answer he says
You can create a function based index in order to get around the NULL values not being indexed
If you follow the link he points you toward, using NVL to replace null values with some other arbitrary value then I'm not sure what else there is to explain.
Is it a good or bad idea? I've heard like you can't have an index on a column with NULL values, so "where data_finished is not null" will be very slow on big tables.
Oracle does index nullable fields, but does not index NULL values
This means that you can create an index on a field marked NULL, but the records holding NULL in this field won't make it into the index.
This, on its turn, means that if you make date_finished NULL, the index will be less in size, as the NULL values won't be stored in the index.
So the queries involving equality of range searches on date_finished will in fact perform better.
The downside of this solution, of course, is that the queries involving the NULL values of date_finished will have to revert to full table scan.
You can work around this by creating two indexes:
CREATE INDEX ON mytable (date_finished)
CREATE INDEX ON mytable (DECODE(date_finished, NULL, 1))
and use this query to find unfinished work:
SELECT *
FROM mytable
WHERE DECODE(date_finished, NULL, 1) = 1
This will behave like partitioned index: the complete works will be indexed by the first index; the incomplete ones will be indexed by the second.
If you don't need to search for complete or incomplete works, you can always get rid of the appropriate indexes.
In terms of table design, I think it's good that you removed the is_finished column as you said that it isn't necessary (it's redundant). There's no need to store extra data if it isn't necessary, it just wastes space. In terms of performance, I don't see this being a problem for NULL values. They should be ignored.
I would use nulls as indexes work, as already mentioned in other answers, for all queries apart from "WHERE date_finished IS NULL" (so it depends if you need to use that query). I definitely wouldn't use outliers like year 9999 as suggested by the answer:
you could also use a "dummy" value (such as 31 December 9999) as the date_finished value for unfinished workflows
Outliers like year 9999 affect performance, because (from http://richardfoote.wordpress.com/2007/12/13/outlier-values-an-enemy-of-the-index/):
The selectivity of a range scan is basically calculated by the CBO to be the number of values in the range of interest divided by the full range of possible values (IE. the max value minus the min value)
If you use a value like 9999 then the DB will think the range of values being stored in the field is e.g. 2008-9999 rather than the actual 2008-2010; so any range query (e.g. "between 2008 and 2009") will appear to be covering a tiny % of the range of possible values, vs. actually covering about half the range. It uses this statistic to say, if the % of the ths possible values covered is high, probably a lot of rows will match, and then a full table scan will be faster than an index scan. It won't do this correctly if there are outliers in the data.
good idea to remove the deriveable value column as others have said.
one more thought is that by removing the column, you will avoid paradoxical conditions that you will need to code around, such as what happens when the is_finished = No and the finished_date = yesterday... etc.
To resolve the indexed / non-indexed columns, wouldn't it be easier to simply JOIN two tables, like this:
-- PostgreSQL
CREATE TABLE workflow(
id SERIAL PRIMARY KEY
, name VARCHAR(100) NOT NULL
);
CREATE TABLE workflow_finished(
id INT NOT NULL PRIMARY KEY REFERENCES workflow
, date_finished date NOT NULL
);
Thus, if a record exists in workflow_finished, this workflow's completed, else it isn't. It seems to me this is rather simple.
When querying for unfinished workflows, the query becomes:
-- Only unfinished workflow items
SELECT workflow.id
FROM workflow
WHERE NOT EXISTS(
SELECT 1
FROM workflow_finished
WHERE workflow_finished.id = workflow.id);
Maybe you want the original query? With a flag and the date? Query like this then:
-- All items, with the flag and date
SELECT
workflow.id
, CASE
WHEN workflow_finished.id IS NULL THEN 'f'
ELSE 't'
END AS is_finished
, workflow_finished.date_finished
FROM
workflow
LEFT JOIN workflow_finished USING(id);
For consumers of the data, views can and should be created for their needs.
As an alternative to a function-based index, you could also use a "dummy" value (such as 31 December 9999, or alternatively one day before the earliest expected date_finished value) as the date_finished value for unfinished workflows.
EDIT: Alternative dummy date value, following comments.
I prefer the single-column solution.
However, in the databases I use most often NULLs are included in indexes, so your common case of searching for open workflows will be fast whereas in your case it will be slower. Because the case of searching for open workflows is likely to be one of the most common things you do, you may need the redundant column simply to support that search.
Test for performance to see if you can use the better solution performance-wise, then fall back to the less-good solution if necessary.

How to know when to use indexes and which type?

I've searched a bit and didn't see any similar question, so here goes.
How do you know when to put an index in a table? How do you decide which columns to include in the index? When should a clustered index be used?
Can an index ever slow down the performance of select statements? How many indexes is too many and how big of a table do you need for it to benefit from an index?
EDIT:
What about column data types? Is it ok to have an index on a varchar or datetime?
Well, the first question is easy:
When should a clustered index be used?
Always. Period. Except for a very few, rare, edge cases. A clustered index makes a table faster, for every operation. YES! It does. See Kim Tripp's excellent The Clustered Index Debate continues for background info. She also mentions her main criteria for a clustered index:
narrow
static (never changes)
unique
if ever possible: ever increasing
INT IDENTITY fulfills this perfectly - GUID's do not. See GUID's as Primary Key for extensive background info.
Why narrow? Because the clustering key is added to each and every index page of each and every non-clustered index on the same table (in order to be able to actually look up the data row, if needed). You don't want to have VARCHAR(200) in your clustering key....
Why unique?? See above - the clustering key is the item and mechanism that SQL Server uses to uniquely find a data row. It has to be unique. If you pick a non-unique clustering key, SQL Server itself will add a 4-byte uniqueifier to your keys. Be careful of that!
Next: non-clustered indices. Basically there's one rule: any foreign key in a child table referencing another table should be indexed, it'll speed up JOINs and other operations.
Furthermore, any queries that have WHERE clauses are a good candidate - pick those first which are executed a lot. Put indices on columns that show up in WHERE clauses, in ORDER BY statements.
Next: measure your system, check the DMV's (dynamic management views) for hints about unused or missing indices, and tweak your system over and over again. It's an ongoing process, you'll never be done! See here for info on those two DMV's (missing and unused indices).
Another word of warning: with a truckload of indices, you can make any SELECT query go really really fast. But at the same time, INSERTs, UPDATEs and DELETEs which have to update all the indices involved might suffer. If you only ever SELECT - go nuts! Otherwise, it's a fine and delicate balancing act. You can always tweak a single query beyond belief - but the rest of your system might suffer in doing so. Don't over-index your database! Put a few good indices in place, check and observe how the system behaves, and then maybe add another one or two, and again: observe how the total system performance is affected by that.
Rule of thumb is primary key (implied and defaults to clustered) and each foreign key column
There is more but you could do worse than using SQL Server's missing index DMVs
An index may slow down a SELECT if the optimiser makes a bad choice, and it is possible to have too many. Too many will slow writes but it's also possible to overlap indexes
Answering the ones I can I would say that every table, no matter how small, will always benefit from at least one index as there has to be at least one way in which you are interested in looking up the data; otherwise why store it?
A general rule for adding indexes would be if you need to find data in the table using a particular field, or set of fields. This leads on to how many indexes are too many, generally the more indexes you have the slower inserts and updates will be as they also have to modify the indexes but it all depends on how you use your data. If you need fast inserts then don't use too many. In reporting "read only" type data stores you can have a number of them to make all your lookups faster.
Unfortunately there is no one rule to guide you on the number or type of indexes to use, although the query optimiser of your chosen DB can give hints based on the queries you are executing.
As to clustered indexes they are the Ace card you only get to use once, so choose carefully. It's worth calculating the selectivity of the field you are thinking of putting it on as it can be wasted to put it on something like a boolean field (contrived example) as the selectivity of the data is very low.
This is really a very involved question, though a good starting place would be to index any column that you will filter results on. ie. If you often break products into groups by sale price, index the sale_price column of the products table to improve scan times for that query, etc.
If you are querying based on the value in a column, you probably want to index that column.
i.e.
SELECT a,b,c FROM MyTable WHERE x = 1
You would want an index on X.
Generally, I add indexes for columns which are frequently queried, and I add compound indexes when I'm querying on more than one column.
Indexes won't hurt the performance of a SELECT, but they may slow down INSERTS (or UPDATES) if you have too many indexes columns per table.
As a rule of thumb - start off by adding indexes when you find yourself saying WHERE a = 123 (in this case, an index for "a").
You should use an index on columns that you use for selection and ordering - i.e. the WHERE and ORDER BY clauses.
Indexes can slow down select statements if there are many of them and you are using WHERE and ORDER BY on columns that have not been indexed.
As for size of table - several thousands rows and upwards would start showing real benefits to index usage.
Having said that, there are automated tools to do this, and SQL server has an Database Tuning Advisor that will help with this.