Index on multiple bit fields in SQL Server - sql

We currently have a scenario where one table effectively has several (10 to 15) boolean flags (not nullable bit fields). Unfortunately, it is not really possible to simplify this too much on a logical level, because any combination of the boolean values is permissible.
The table in question is a transactional table which may end up having tens of millions of rows, and both insert and select performance is fairly critical. Although we are not quite sure of the distribution of the data at this time, the combination of all flags should provide relative good cardinality, i.e. make it a "worthwhile" index for SQL Server to make use of.
Typical select query scenarios might be to select records based on 3 or 4 of the flags only, e.g. WHERE FLAG3=1 AND FLAG7=0 AND FLAG9=1. It would not be practical to create separate indexes for all combinations of the flags used by these select queries, as there will be many of them.
Given this situation, what would be the recommended approach to effectively index these fields? The table is new, so there is no existing data to worry about yet, and we have a fair amount of flexibility in the actual implementation of the table.
There are two main options that we are considering at the moment:
Create a single index which includes all the bit fields (this would probably include 1 or 2 other int fields which would always used). My concern is that given the typical usage of only including a few of the fields, this approach would skip the index and resort to a table scan. Let's call this Option A (Having read some of the replies, it seems that this approach would not work well, since the order of the fields in the index would make a difference making it impossible to index effectively on ALL the fields).
Effectively do what I believe SQL Server is doing internally, and encode the bit fields into a single int field using binary operators (AND-ing and OR-ing numbers together: 1, 2, 4, 8, etc). My concern here is that we'd need to do some kind of calculation to query on this encoded field, which would skip the index again. Maintenance and the complexity of this solution is also a concern. Let's call this Option B. Additional info: The argument for this approach is that we could have a relatively simple and short index which includes one or two other fields from the table and this field. The other fields would narrow down the number of records needing to be evaluated, and since the encoded field would contain all of our bit fields, SQL Server would be able to perform the calculation using the data retrieved from the index directly (i.e. an index scan) as opposed to the table (i.e. a table scan).
At the moment, we are heavily leaning towards Option B. For completeness, this would be running on SQL Server 2008.
Any advice would be greatly appreciated.
Edit: Spelling, clarity, query example, additional info on Option B.

A single BIT column typically is not selective enough to be even considered for use in an index. So an index on a single BIT column really doesn't make sense - on average, you'd always have to search about half the entries in the table (50% selectiveness) and so the SQL Server query optimizer will instead use a table scan.
If you create a single index on all 15 bit columns, then you don't have that problem - since you have 15 yes/no options, your index will become quite selective.
Trouble is: the sequence of the bit columns is important. Your index will only ever be considered if your SQL statement uses at least 1-n of the left-most BIT columns.
So if your index is on
Col1,Col2,Col3,....,Col14,Col15
then it might be used for a query that uses
Col1
Col1 and Col2
Col1 and Col2 and Col3
....
and so on. But it cannot be used for a query that specifies Col6,Col9 and Col14.
Because of that, I don't really think an index on your collection of BIT columns really makes a lot of sense.
Are those 15 BIT columns the only columns you use for querying? If not, I would try to combine those BIT columns that you use most for selection with other columns, e.g. have an index on Name and Col7 or something (then your BIT columns can add some additional selectivity to another index)

Whilst there are probably ways to solve your indexing problem against your existing table schema, I would reduce this to a normalisation problem:
e.g I would highly recommend creating a series of new tables:
Lookup table for the names of this bit flags. e.g. CREATE TABLE Flags (id int IDENTITY(1,1), Name varchar(256)) (you don't have to make id an identity-seed column if you want to manually control the id's - e.g. 2,4,8,16,32,64,128 as binary flags.)
Create a new link-table that contains the id's of the original data table and the new link table e.g. CREATE TABLE DataFlags_Link (id int IDENTITY(1,1), MyFlagId int, DataId int)
You could then create an index on the DataFlags_Link table and write queries like:
SELECT Data.*
FROM Data
INNER JOIN DataFlags_Link ON Data.id = DataFlags_Link.DataId
WHERE DataFlags_Link.MyFlagId IN (4,7,2,8)
As for performance, that's where good DBA maintenance comes in. You'll want to set the INDEX fill-factor and padding on your tables appropriately and run regular index defragmentation or rebuild your indexes on a schedule.
Performance and maintenance go hand-in-hand with databases. You can't have one without the other.

Whilst I think Neil Fenwick's answer is probably right, I think the real answer is to try out the different options and see which one is fast enough.
Option 1 is probably the most straightforward solution, and therefore likely the most maintainable - and it may well be fast enough.
I would build a prototype database, with the "option 1" schema, and use something like http://www.red-gate.com/products/sql-development/sql-data-generator/ or http://sourceforge.net/projects/dbmonster/ to create twice as much data as you expect to need, and then build the queries you expect to need. Agree an acceptable response time, and only consider a "faster" schema if you exceed those response times (and you can't throw hardware at the problem).
Neil's solution is probably just as obvious and maintainable as "option 1" - and it should be easy to index. However, I'd still test it by creating a prototype schema and generating a lot of test data...

Related

DB Architecture: One table using WHERE vs multiple

I wonder what is the difference between having one table with 6 millions row (aka with a huge DB) and 100k active users:
CREATE TABLE shoes (
id serial primary key,
color text,
is_left_one boolean,
stock int
);
With also 6 index like:
CREATE INDEX blue_left_shoes ON shoes(color,is_left_one) WHERE color=blue AND is_left_one=true;
Versus: 6 tables with 1 million rows:
CREATE TABLE blue_left_shoes(
id serial primary key,
stock int
);
The latter one seems more efficient because users don't have to ask for the condition since the table IS the condition, but perhaps creating the indexes mitigate this?
This table is used to query either left, right, "blue", "green" or "red" shoes and to check the number of remaining items, but it is a simplified example but you can think of Amazon (or any digital selling platform) tooltip "only 3 items left in stock" for the workload and the usecase. It is the users (100k active daily) who will make the query.
NB: The question is mostly for PostgreSQL but differences with other DB is still relevant and interesting.
In the latter case, where you use a table called blue_left_shoes
Your code needs to first work out which table to look at (as opposed to parameterising a value in the where clause)
As permutations and options increase, you need to increase the number of tables, and increase the logic in your app that works out which table to use
Anything that needs to use this database (i.e. a reporting tool or an API) now needs to re implement all of these rules
You are imposing logic at a high layer to improve performance.
If you were to partition and/or index your table appropriately, you get the same effect - SQL queries only look through the records that matter. The difference is that you don't need to implement this logic in higher layers
As long as you can get the indexing right, keeping this is one table is almost always the right thing to do.
Partitioning
Database partitioning is where you select one or more columns to decide how to "split up" your table. In your case you could choose (color, is_left_one).
Now your table is logically split and ordered in this way and when you search for blue,true it automatically knows which partition to look in. It doesn't look in any other partitions (this is called partition pruning)
Note that this occurs automatically from the search criteria. You don't need to manually work out a particular table to look at.
Partitioning doesn't require any extra storage (beyond various metadata that has to be saved)
You can't apply multiple partitions to a table. Only one
Indexing
Creating an index also provides performance improvements. However indexes take up space and can impact insert and update performance (as they need to be maintained). Practically speaking, the select trade off almost always far outweighs any insert/update negatives
You should always look at indexes before partitioning
Non selective indexes
In your particular case, there's an extra thing to consider: a boolean field is not "selective". I won't go into details but suffice to say you shouldn't create an index on this field alone, as it won't be used because it only halves the number of records you have to look through. You'd need to include some other fields in any index (i.e. colour) to make it useful
In general, you want to keep all "like" data in a single table, not split among multiples. There are good reasons for this:
Adding new combinations is easier.
Maintaining the tables is easier.
You an easily do queries "across" entities.
Overall, the database is more efficient, because it is more likely that pages will be filled.
And there are other reasons as well. In your case, you might have an argument for breaking the data into 6 separate tables. The gain here comes from not having the color and is_left_one in the data. That means that this data is not repeated 6 million times. And that could save many tens of megabytes of data storage.
I say the last a bit tongue-in-cheek (meaning I'm not that serious). Computers nowadays have so much member that 100 Mbytes is just not significant in general. However, if you have a severely memory limited environment (I'm thinking "watch" here, not even "smart phone") then it might be useful.
Otherwise, partitioning is a fine solution that pretty much meets your needs.
For this:
WHERE color=blue AND is_left_one=true
The optimal index is
INDEX(color, is_left_one) -- in either order
Having id first makes it useless for that WHERE.
It is generally bad to have multiple identical tables instead of one.

SQL Server table optimal indexing

I have a very specific question, this is part of a job interview test.
I have this table:
CREATE TABLE Teszt
(
Id INT NOT NULL
, Name NVARCHAR(100)
, [Description] NVARCHAR(MAX)
, Value DECIMAL(20,4)
, IsEnabled BIT
)
And these selects:
SELECT Name
FROM Teszt
WHERE Id = 10
SELECT Id, Value
FROM Teszt
WHERE IsEnabled = 1
SELECT [Description]
FROM Teszt
WHERE Name LIKE '%alma%'
SELECT [Description]
FROM Teszt
WHERE Value > 1000 AND IsEnabled = 1
SELECT Id, Name
FROM Teszt
WHERE IsEnabled = 1
The question is, where on this table should I put indexes to optimize the performance of the above queries. No other info on the table was provided, so my answer will contain the general pro/contra arguments for indexes, but I'm not sure regarding the above queries.
My thoughts on optimizing these specific queries with indexes:
Id should probably have an index, looks like the primary key and it is part of a where clause;
Creating one on the Value column would also be good, as its part of a where clause here;
Now it gets murky for me. for the Name column, based on just the above queries, I probably shouldn't create one, as it is used with LIKE, which defeats the purpose of an index, right?
I tried to read everything on indexing a bit column (isEnabled column in the table), but I couldn't say it's any clearer to me, as the arguments are wildly ranging. should I create an index on it? should that be filtered? should it be part of a separate index, or just part of one with the other columns?
Again, this is all theoretical, so no info on the size or the usage of the table.
Thanks in advance for any answer!
Regards,
Tom
An index on a bit column is generally not recommended. The following discussion applies not only to bit columns but to any low-cardinality value. In English, "low-cardinality" means the column takes on only a handful of values.
The reason is simple. A bit column takes on three values (if you include NULL). That means that a typical selection on the column would return about a third of the rows. A third of the rows means that you would (typically) be accessing every data page. If so, you might as well do a full table scan.
So, let's ask the question explicitly: When is an index on a bit index useful or appropriate?
First, the above argument does not work if you are always looking for IsEnabled = 1 and, say, 0.001% of the rows are enabled. This is a highly selective query and an index could help. Note: The index would not help on IsEnabled = 0 in this scenario.
Second, the above argument argues in favor of a clustered index on the bit value. If the values are clustered, then even a 30% selectivity means that you are only reading 30% of the rows. The downside is that updating the value means moving the record from one data page to another (a somewhat expensive operation).
Third, a bit column can constructively be part of a larger index. This is especially true of a clustered index with the bit being first. For instance, for the fourth query, one could argue that a clustered index on (IsEnabled, Value, Description) would be the optimal index.
To be honest, though, I don't like playing around with clustered indexes. I prefer that the primary key be the clustered index. I admit that performance gains can be impressive for a narrow set of queries -- and if this is your use case, then use them (and accessing enabled rows might be a good reason for using them). However, the clustered index is something that you get to use only once, and primary keys are the best generic option to optimize joins.
You can read the detail about on how to create index from this article: https://msdn.microsoft.com/en-us/library/ms188783.aspx
As you said there are pros and cons using index.
Pros: Select query will be faster
Cons: Insert query will be slower
Conclusion: Add index if your table has less INSERT AND most SELECT operation.
In which Column I should consider adding index? This is really a very good question. Though I am not the DB expert, here are my views:
Add index on your primary key column
Add index on your join column [inner/outer/left]
Short answer: on Id and IsEnabled
(despite the controversy about indexing on BIT field; and Id should be primary key)
Generally, to optimize the performance, indexes should be on fields, where there is WHERE or JOIN. (Under the hood) To make the selection, the db server looks for index, and if not found -- creates one on-the-fly in the memory, which takes time, hence performance degradation.
As Bhuwan noted, indexes are "bad" for INSERTs (keep that in mind for the whole picture when designing a database), but there are just SELECTs in the example privided.
Hope you passed the test :)
-Nick
tldr: I will probably delete this later, so no need!
My answer to this job interview question: "It depends." ... and then I would probably spend too much of the interview talking about how terrible the question is.
The problem is that this is simply a bad question for a "job interview test". I have been poking at this for two hours now, and the longer I spend the more annoyed I get.
With absolutely no information on the content of the table, we can not guarantee that this table is even in the first normal form or better, so we can not even assume that the only non nullable column Id is a valid primary key.
With no idea about the content of the table, we do not even know if it needs indexes. If it has only a few rows, then the entire page will sit in memory and whichever operations you are running against it will be fast enough.
With no cardinality information we do not know if a value > 1000 is common or uncommon. All or none of the values could be greater than 1000, but we do not know.
With no cardinality information we do not know if IsEnabled = 1 would mean 99% of rows, or even 0% of rows.
I would say you are on the right track as far as your thought process for evaluating indexing, but the trick is that you are drawing from your experiences with indexes you needed on tables before this table. Applying assumptions based on general previous experience is fine, but you should always test them. In this case, blindly applying general practices could be a mistake.
The question is, where on this table should I put indexes to optimize the performance of the above queries. No other info on the table was provided
If I try to approach this from another position: Nothing else matters except the performance of these five queries, I would apply these indexes:
create index ixf_Name on dbo.Teszt(Name)
include (Id)
where id = 10;
create index ixf_Value_Enabled on dbo.Teszt(Value)
include (Id)
where IsEnabled = 1;
create index ixf_Value_gt1k_Enabled on dbo.Teszt(Id)
include (description,value,IsEnabled)
where Value > 1000 and IsEnabled = 1;
create index ixf_Name_Enabled on dbo.Teszt(Id)
include (Name, IsEnabled)
where IsEnabled = 1;
create index ixf_Name_notNull on dbo.Teszt(Name)
include (Description)
where Name is not null;
Also, the decimal(20,4) annoys me because this is the least amount of data you can store in the 13 bytes of space it takes up. decimal(28,4) has the same storage size and if it could have been decimal(19,4) then it would have been only 9 bytes. Granted this is a silly thing to be annoyed about, especially considering the table is going to be wide anyway, but I thought I would point it out anyway.

Table index design

I would like to add index(s) to my table.
I am looking for general ideas how to add more indexes to a table.
Other than the PK clustered.
I would like to know what to look for when I am doing this.
So, my example:
This table (let's call it TASK table) is going to be the biggest table of the whole application. Expecting millions records.
IMPORTANT: massive bulk-insert is adding data in this table
table has 27 columns: (so far, and counting :D )
int x 9 columns = id-s
varchar x 10 columns
bit x 2 columns
datetime x 5 columns
INT COLUMNS
all of these are INT ID-s but from tables that are usually smaller than Task table (10-50 records max), example: Status table (with values like "open", "closed") or Priority table (with values like "important", "not so important", "normal")
there is also a column like "parent-ID" (self - ID)
join: all the "small" tables have PK, the usual way ... clustered
STRING COLUMNS
there is a (Company) column (string!) that is something like "5 characters long all the time" and every user will be restricted using this one. If in Task there are 15 different "Companies" the logged in user would only see one. So there's always a filter on this one. Might be a good idea to add an index to this column?
DATE COLUMNS
I think they don't index these ... right? Or can / should be?
I wouldn't add any indices - unless you have specific reasons to do so, e.g. performance issues.
In order to figure out what kind of indices to add, you need to know:
what kind of queries are being used against your table - what are the WHERE clauses, what kind of ORDER BY are you doing?
how is your data distributed? Which columns are selective enough (< 2% of the data) to be useful for indexing
what kind of (negative) impact do additional indices have on your INSERTs and UPDATEs on the table
any foreign key columns should be part of an index - preferably as the first column of the index - to speed up JOINs to other tables
And sure you can index a DATETIME column - what made you think you cannot?? If you have a lot of queries that will restrict their result set by means of a date range, it can make total sense to index a DATETIME column - maybe not by itself, but in a compound index together with other elements of your table.
What you cannot index are columns that hold more than 900 bytes of data - anything like VARCHAR(1000) or such.
For great in-depth and very knowledgeable background on indexing, consult the blog by Kimberly Tripp, Queen of Indexing.
in general an index will speed up a JOIN, a sort operation and a filter
SO if the columns are in the JOIN, the ORDER BY or the WHERE clause then an index will help in terms of performance...but there is always a but...with every index that you add UPDATE, DELETE and INSERT operations will be slowed down because the indexes have to be maintained
so the answer is...it depends
I would say start hitting the table with queries and look at the execution plans for scans, try to make those seeks by either writing SARGable queries or adding indexes if needed...don't just add indexes for the sake of adding indexes
Step one is to understand how the data in the table will be used: how will it be inserted, selected, updated, deleted. Without knowing your usage patterns, you're shooting in the dark. (Note also that whatever you come up with now, you may be wrong. Be sure to compare your decisions with actual usage patterns once you're up and running.) Some ideas:
If users will often be looking up individual items in the table, an index on the primary key is critical.
If data will be inserted with great frequency and you have multiple indexes, over time you well have to deal with index fragmentation. Read up on and understand clustered and non-clustered indexes and fragmentation (ALTER INDEX...REBUILD).
But, if performance is key in situations when you need to retrieve a lot of rows, you might consider using your clustered indexe to support that.
If you often want a set of data based on Status, indexing on that column can be good--particularly if 1% of your rows are "Active" vs. 99% "Not Active", and all you want are the active ones.
Conversely, if your "PriorityId" is only used to get the "label" stating what PriorityId 42 is (i.e. join into the lookup table), you probably don't need an index on it in your main table.
A last idea, if everyone will always retrieve data for only one Company at a time, then (a) you'll definitely want to index on that, and (b) you might want to consider partitioning the table on that value, as it can act as a "built in filter" above and beyond conventional indexing. (This is perhaps a bit extreme and it's only available in Enterprise edition, but it may be worth it in your case.)

effect of number of projections on query performance

I am looking to improve the performance of a query which selects several columns from a table. was wondering if limiting the number of columns would have any effect on performance of the query.
Reducing the number of columns would, I think, have only very limited effect on the speed of the query but would have a potentially larger effect on the transfer speed of the data. The less data you select, the less data that would need to be transferred over the wire to your application.
I might be misunderstanding the question, but here goes anyway:
The absolute number of columns you select doesn't make a huge difference. However, which columns you select can make a significant difference depending on how the table is indexed.
If you are selecting only columns that are covered by the index, then the DB engine can use just the index for the query without ever fetching table data. If you use even one column that's not covered, though, it has to fetch the entire row (key lookup) and this will degrade performance significantly. Sometimes it will kill performance so much that the DB engine opts to do a full scan instead of even bothering with the index; it depends on the number of rows being selected.
So, if by removing columns you are able to turn this into a covering query, then yes, it can improve performance. Otherwise, probably not. Not noticeably anyway.
Quick example for SQL Server 2005+ - let's say this is your table:
ID int NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
Name varchar(50) NOT NULL,
Status tinyint NOT NULL
If we create this index:
CREATE INDEX IX_MyTable
ON MyTable (Name)
Then this query will be fast:
SELECT ID
FROM MyTable
WHERE Name = 'Aaron'
But this query will be slow(er):
SELECT ID, Name, Status
FROM MyTable
WHERE Name = 'Aaron'
If we change the index to a covering index, i.e.
CREATE INDEX IX_MyTable
ON MyTable (Name)
INCLUDE (Status)
Then the second query becomes fast again because the DB engine never needs to read the row.
Limiting the number of columns has no measurable effect on the query. Almost universally, an entire row is fetched to cache. The projection happens last in the SQL pipeline.
The projection part of the processing must happen last (after GROUP BY, for instance) because it may involve creating aggregates. Also, many columns may be required for JOIN, WHERE and ORDER BY processing. More columns than are finally returned in the result set. It's hardly worth adding a step to the query plan to do projections to somehow save a little I/O.
Check your query plan documentation. There's no "project" node in the query plan. It's a small part of formulating the result set.
To get away from "whole row fetch", you have to go for a columnar ("Inverted") database.
It can depend on the server you're dealing with (and, in the case of MySQL, the storage engine). Just for example, there's at least one MySQL storage engine that does column-wise storage instead of row-wise storage, and in this case more columns really can take more time.
The other major possibility would be if you had segmented your table so some columns were stored on one server, and other columns on another (aka vertical partitioning). In this case, retrieving more columns might involve retrieving data from different servers, and it's always possible that the load is imbalanced so different servers have different response times. Of course, you usually try to keep the load reasonably balanced so that should be fairly unusual, but it's still possible (especially if, for example, if one of the servers handles some other data whose usage might vary independently from the rest).
yes, if your query can be covered by a non clustered index it will be faster since all the data is already in the index and the base table (if you have a heap) or clustered index does not need to be touched by the optimizer
To demonstrate what tvanfosson has already written, that there is a "transfer" cost I ran the following two statements on a MSSQL 2000 DB from query analyzer.
SELECT datalength(text) FROM syscomments
SELECT text FROM syscomments
Both results returned 947 rows but the first one took 5 ms and the second 973 ms.
Also because the fields are the same I would not expect indexing to factor here.

SQL `LIKE` complexity

Does anyone know what the complexity is for the SQL LIKE operator for the most popular databases?
Let's consider the three core cases separately. This discussion is MySQL-specific, but might also apply to other DBMS due to the fact that indexes are typically implemented in a similar manner.
LIKE 'foo%' is quick if run on an indexed column. MySQL indexes are a variation of B-trees, so when performing this query it can simply descend the tree to the node corresponding to foo, or the first node with that prefix, and traverse the tree forward. All of this is very efficient.
LIKE '%foo' can't be accelerated by indexes and will result in a full table scan. If you have other criterias that can by executed using indices, it will only scan the the rows that remain after the initial filtering.
There's a trick though: If you need to do suffix matching - searching for file names with extension .foo, for instance - you can achieve the same performance by adding a column with the same contents as the original one but with the characters in reverse order.
ALTER TABLE my_table ADD COLUMN col_reverse VARCHAR (256) NOT NULL;
ALTER TABLE my_table ADD INDEX idx_col_reverse (col_reverse);
UPDATE my_table SET col_reverse = REVERSE(col);
Searching for rows with col ending in .foo then becomes:
SELECT * FROM my_table WHERE col_reverse LIKE 'oof.%'
Finally, there's LIKE '%foo%', for which there are no shortcuts. If there are no other limiting criterias which reduces the amount of rows to a feasible number, it'll cause a hard performance hit. You might want to consider a full text search solution instead, or some other specialized solution.
If you are asking about the performance impact:
The problem of like is that it keeps the database from using an index. On Oracle I think it doesn't use indexes anymore (but I'm still on Oracle 9). SqlServer uses indexes if the wildcard is only at the end. I don't know about other databases.
Depends on the RDBMS, the data (and possibly size of data), indexes and how the LIKE is used (with or without prefix wildcard)!
You are asking too general a question.