SQL Index Performance - sql

We have a table called table1 ...
(c1 int indentity,c2 datetime not null,c3 varchar(50) not null,
c4 varchar(50) not null,c5 int not null,c6 int ,c7 int)
on column c1 is primary key(clusterd Index)
on column c2 is index_2(Nonclusterd)
on column c3 is index_2(Nonclusterd)
on column c4 is index_2(Nonclusterd)
on column c5 is index_2(Nonclusterd)
It contains 10 million records. We have several procedures pointing to "table1" with different search criteria:
select from table1 where c1=blah..and c2= blah.. and c3=blah..
select from table1 where c2=blah..and c3= blah.. and c4=blah..
select from table1 where c1=blah..and c3= blah.. and c5=blah..
select from table1 where c1=blah..
select from table1 where c2=blah..
select from table1 where c3=blah..
select from table1 where c4=blah..
select from table1 where c5=blah..
What is the best way to create non-clustered index apart from above, or modify existing indexes to get good index performance and reduce the execution time?

And now to actually respond...
The trick here is that you have single-column lookups on any number of columns, as well as composite column lookups. You need to understand with what frequency the queries above are executing - for those that are run very seldom, you should probably exclude them from your indexing considerations.
You may be better off creating single NCIX's on each of the columns being queried. This would likely be the case if the number of rows being returned is very small, as the NCIX's would be able to handle the "single lookup" queries as well as the composite lookups. Alternatively, you could create single-column NCIX's in addition to covering composite indexes - again, the deciding factor being the frequency of execution and the number of results being returned.

This is somewhat tough to answer with just the information you have provided. There are other factors you need to weigh out.
For example:
How often is the table updated and what columns are updated frequently?
You'll be paying a cost on these updates due to index maintenance.
What is the cardinality of your different columns?
What queries are you executing most often and what columns appear in the where clause of those queries?
You need to first figure out what your parameters for acceptable performance are for each of your queries and work from their taking into account the things I have mentioned above.
With 10 million rows, partitioning your table could make a lot of sense here.

Have you thought about using the Full Text Search component of MSSQL. It might offer the performance that you are looking for?

Related

db2 10.5 multi-column index explanation

My first time working with indexes in database and so far I've learn that if you have a multi-column index such as index('col1', 'col2', 'col3'), and if you do a query that uses where col2='col2' and col3='col3', that index would not be use.
I also learn that if a column is very low selectivity column. Indexing is useless.
However, from my test, it seems none of the above is true at all. Can someone explain more on this?
I have a table with more than 16 million records. Let's say claimID is the primary key, then there're a historynumber column that only have 3 distinct values (1,2,3), and a last column with storeNumber that has about 1 million distinct values.
I have an index for claimID alone, another index(historynumber, claimID), and other index with index(historynumber, storeNumber), and finally index(storeNumber, historynumber).
My guess was that if I do:
select * from my_table where claimId='123456' and historynumber = 1
would be much faster than
select * from my_table where historynumber = 1 and claimId = '123456'
However, the 2 have exactly the same performance (instant). So I thought the primary key index can work on any column order. Therefore, I tried the same thing but on historynumber and storeNumber instead. The result is exactly the same. Then I start trying out on columns that has no indexes and of course the result is the same also.
Finally, I do a
select * from my_table where historynumber = 1
and the query takes so long I had to cancel it.
So my conclusion is that the column order in where clause is completely useless, and so is the column order in the index definition since it seems like the database is smart enough to tell which column is the highest selectivity column.
Could someone give me an example that could prove otherwise?
Index explanation is a huge topic.
Don't worry about the sequence of different attributes in the SQL - it has no effect whether you specify
...where claimId='123456' and historynumber = 1
or the other way round. Each SQL is checked and optimized by the optimizer. To proove how the data gets accessed you could do a EXPLAIN. Check the documentation for more details.
For your other problem
select * from my_table where historynumber = 1
with an index of (storeNumber, historynumber).
Have you ever tried to lookup the name of a caller (having the telephone number) in a telephone book?
Well it is pretty much the same for an index - so the column order when creatin the index matters!
There are techniques which could help - i.e. index jump scan - but there is no guarantee.
Check out following sites to learn a little bit more about DB2 indexes:
http://db2commerce.com/2013/09/19/db2-luw-basics-indexes/
http://use-the-index-luke.com/sql/where-clause/the-equals-operator/concatenated-keys

TSQL join performance

My problem is that this query takes forever to run:
Select
tableA.CUSTOMER_NAME,
tableB.CUSTOMER_NUMBER,
TableB.RuleID
FROM tableA
INNER JOIN tableB on tableA.CUST_PO_NUMBER like tableB.CustomerMask
Here is the structure of the tables:
CREATE TABLE [dbo].[TableA](
[CUSTOMER_NAME] [varchar](100) NULL,
[CUSTOMER_NUMBER] [varchar](50) NULL,
[CUST_PO_NUMBER] [varchar](50) NOT NULL,
[ORDER_NUMBER] [varchar](30) NOT NULL,
[ORDER_TYPE] [varchar](30) NULL)
CREATE TABLE [dbo].[TableB](
[RuleID] [varchar](50) NULL,
[CustomerMask] [varchar](500) NULL)
TableA has 14 million rows and TableB has 1000 rows. Data in column customermask can be anything like ‘%’,’ttt%’,’%ttt%’..etc
How can I tune it to make it faster?
Thanks!
The short answer is don't use the LIKE operator to join two tables containing millions of rows. It's not going to be fast, no matter how you tune it. You might be able to improve it incrementally, but it will just be putting lipstick on a pig.
You need to have a distinct value on which to join the tables. Right now it has to do a complete scan of tableA, and do an item-by-item wildcard comparison between Customer_Name and CustomerMask. You're looking at 14 billion comparisons, all using the slow LIKE operator.
The only suggestion I can give is to re-think the architecture of associating rules with Customers.
While you can't change what's already there, you can create a new table like this:
CREATE TABLE [dbo].[TableC](
[CustomerMask] [varchar](500) NULL)
[CUST_PO_NUMBER] [varchar](50) NOT NULL)
Then have a trigger on both TableA and TableB that inserts / updates / deletes records in TableC if they no longer match the condition CUST_PO_NUMBER LIKE CustomerMask (for the trigger on TableB you need to only update TableC if the CustomerMask field has been changed.
Then in your query will just become:
SELECT
tableA.CUSTOMER_NAME,
tableB.CUSTOMER_NUMBER,
TableB.RuleID
FROM tableA
INNER JOIN tableC on tableA.CUST_PO_NUMBER = tableC.CUST_PO_NUMBER
INNER JOIN tableB on tableC.CustomerMask = tableB.CustomerMask
This will greatly improve your query performance and it shouldn't greatly affect your write performance. You will basically only be performing the like query once for each record (unless they change).
Only change order join then faster and enjoy! use this query:
Select tableA.CUSTOMER_NAME, tableB.CUSTOMER_NUMBER, TableB.RuleID
FROM tableB
INNER JOIN tableA
on tableB.CustomerMask like tableA.CUST_PO_NUMBER
Am I missing something? What about the following:
Select
tableA.CUSTOMER_NAME,
tableA.CUSTOMER_NUMBER,
tableB.RuleID
FROM tableA, tableB
WHERE tableA.CUST_PO_NUMBER = tableB.CustomerMask
EDIT2: Thinking about it, how many of those masks start and end with wildcards? You might gain some performance by first:
Indexing CUST_PO_NUMBER
Creating a persisted computed column CUST_PO_NUMBER_REV that's the reverse of CUST_PO_NUMBER
Indexing the persisted column
Putting statistics on these columns
Then you might build three queries, and UNION ALL the results together:
SELECT ...
FROM ...
ON CUSTOM_PO_NUMBER LIKE CustomerMask
WHERE /* First character of CustomerMask is not a wildcard but last one is */
UNION ALL
SELECT ...
FROM ...
ON CUSTOM_PO_NUMBER_REV LIKE REVERSE(CustomerMask)
WHERE /* Last character of CustomerMask is not a wildcard but first one is */
UNION ALL
SELECT ...
FROM ...
ON CUSTOM_PO_NUMBER LIKE CustomerMask
WHERE /* Everything else */
That's just a quick example, you'll need to take some care that the WHERE clauses give you mutually exclusive results (or use UNION, but aim for mutually exclusive results first).
If you can do that, you should have two queries using index seeks and one query using index scans.
EDIT: You can implement a sharding system to spread out the customers and customer masks tables across multiple servers and then have each server evaluate 1/n% of the results. You don't need to partition the data -- simple replication of the entire contents of each table will do. Link the servers to your main server and you can do something to the effect of:
SELECT ... FROM OPENQUERY(LinkedServer1, 'SELECT ... LIKE ... WHERE ID BETWEEN 0 AND 99')
UNION ALL
SELECT ... FROM OPENQUERY(LinkedServer2, 'SELECT ... LIKE ... WHERE ID BETWEEN 100 AND 199')
Note: the OPENQUERY may be extraneous, SQL Server might be smart enough to evaluate queries on remote servers and stream the results back. I know it doesn't do that for JET linked servers, but it might handle its own kind better.
That or through more hardware at the problem.
You can create an Indexed View of your query to improve performance.
From Designing Indexed Views:
For a standard view, the overhead of dynamically building the result set for each query that references a view can be significant for views that involve complex processing of large numbers of rows, such as aggregating lots of data, or joining many rows. If such views are frequently referenced in queries, you can improve performance by creating a unique clustered index on the view. When a unique clustered index is created on a view, the result set is stored in the database just like a table with a clustered index is stored.
Another benefit of creating an index on a view is that the optimizer starts using the view index in queries that do not directly name the view in the FROM clause. Existing queries can benefit from the improved efficiency of retrieving data from the indexed view without having to be recoded.
This should improve the performance of this particular query, but note that inserts, updates and deleted into the tables it uses may be slowed.
You can't use LIKE if you care about performance.
If you are trying to do approximate string matching (e.g. Test and est and best, etc.) and you don't want to use Sql full-text search take a look at this article.
At least you can shortlist approximate matches then run your wildcard test on them.
--EDIT 2--
Your problem is interesting in the context of your limitation. Thinking about it again, I am pretty sure that using 3 gram would boost the performance (going back to my initial suggestion).
Let's say if you setup your 3gram data, you'll be having the following tables:
Customer : 14M
Customer3Grams : Maximum 700M //Considering the field is varchar(50)
3Grams : 78
Pattern : 1000
Pattern3Grams : 50K
To join pattern to customer then you need the following join:
Pattern x Pattern3Grams x Customer3Grams x Customer
With appropriate indexing (which is easy) each look-up can happen in O(LOG(50K)+LOG(700M)+LOG(14M)) which is equal to 47.6.
Considering appropriate indexes are present the whole join can be calculated with less than 50,000 look-ups and of course the cost of scanning after look ups. I expect it to be very efficient (matter of seconds).
The cost of creating 3grams for each new customer is also minimal because it would be maximum of 50x75 possible three grams that should be appended to the customer3Grams table.
--EDIT--
Depending to your data I can also suggest hash based clustering. I assume customer numbers are numbers with some character patterns in them (e.g. 123231ttt3x4). If this is the case you can create a simple hash function that calculates the result of bit-wise OR for every letter (not numbers) and add it as an indexed column to your table. You can filter on the result of the hash before applying LIKE.
Depending to your data this may cluster your data effectively and improve your search by factor of the number of clusters (number of hash). You can test it by applying the hash and counting the number of distinct generated hash.

Using more than one index per table is dangerous?

In a former company I worked at, the rule of thumb was that a table should have no more than one index (allowing the odd exception, and certain parent-tables holding references to nearly all other tables and thus are updated very frequently).
The idea being that often, indexes cost the same or more to uphold than they gain. Note that this question is different to indexed-view-vs-indexes-on-table as the motivation is not only reporting.
Is this true? Is this index-purism worth it?
In your career do you generally avoid using indexes?
What are the general large-scale recommendations regarding indexes?
Currently and at the last company we use SQL Server, so any product specific guidelines are welcome too.
You need to create exactly as many indexes as you need to create. No more, no less. It is as simple as that.
Everybody "knows" that an index will slow down DML statements on a table. But for some reason very few people actually bother to test just how "slow" it becomes in their context. Sometimes I get the impression that people think that adding another index will add several seconds to each inserted row, making it a game changing business tradeoff that some fictive hotshot user should decide in a board room.
I'd like to share an example that I just created on my 2 year old pc, using a standard MySQL installation. I know you tagged the question SQL Server, but the example should be easily converted. I insert 1,000,000 rows into three tables. One table without indexes, one table with one index and one table with nine indexes.
drop table numbers;
drop table one_million_rows;
drop table one_million_one_index;
drop table one_million_nine_index;
/*
|| Create a dummy table to assist in generating rows
*/
create table numbers(n int);
insert into numbers(n) values(0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
/*
|| Create a table consisting of 1,000,000 consecutive integers
*/
create table one_million_rows as
select d1.n + (d2.n * 10)
+ (d3.n * 100)
+ (d4.n * 1000)
+ (d5.n * 10000)
+ (d6.n * 100000) as n
from numbers d1
,numbers d2
,numbers d3
,numbers d4
,numbers d5
,numbers d6;
/*
|| Create an empty table with 9 integer columns.
|| One column will be indexed
*/
create table one_million_one_index(
c1 int, c2 int, c3 int
,c4 int, c5 int, c6 int
,c7 int, c8 int, c9 int
,index(c1)
);
/*
|| Create an empty table with 9 integer columns.
|| All nine columns will be indexed
*/
create table one_million_nine_index(
c1 int, c2 int, c3 int
,c4 int, c5 int, c6 int
,c7 int, c8 int, c9 int
,index(c1), index(c2), index(c3)
,index(c4), index(c5), index(c6)
,index(c7), index(c8), index(c9)
);
/*
|| Insert 1,000,000 rows in the table with one index
*/
insert into one_million_one_index(c1,c2,c3,c4,c5,c6,c7,c8,c9)
select n, n, n, n, n, n, n, n, n
from one_million_rows;
/*
|| Insert 1,000,000 rows in the table with nine indexes
*/
insert into one_million_nine_index(c1,c2,c3,c4,c5,c6,c7,c8,c9)
select n, n, n, n, n, n, n, n, n
from one_million_rows;
My timings are:
1m rows into table without indexes: 0,45 seconds
1m rows into table with 1 index: 1,5 seconds
1m rows into table with 9 indexes: 6,98 seconds
I'm better with SQL than statistics and math, but I'd like to think that:
Adding 8 indexes to my table, added (6,98-1,5) 5,48 seconds in total. Each index would then have contributed 0,685 seconds (5,48 / 8) for all 1,000,000 rows. That would mean that the added overhead per row per index would have been 0,000000685 seconds. SOMEBODY CALL THE BOARD OF DIRECTORS!
In conclusion, I'd like to say that the above test case doesn't prove a shit. It just shows that tonight, I was able to insert 1,000,000 consecutive integers into in a table in a single user environment. Your results will be different.
That is utterly ridiculous. First, you need multiple indexes in order to perfom correctly. For instance, if you have a primary key, you automatically have an index. that means you can't index anything else with the rule you described. So if you don't index foreign keys, joins will be slow and if you don't index fields used in the where clause, queries will still be slow. Yes you can have too many indexes as they do take extra time to insert and update and delete records, but no more than one is not dangerous, it is a requirement to have a system that performs well. And I have found that users tolerate a longer time to insert better than they tolerate a longer time to query.
Now the exception might be for a system that takes thousands of readings per second from some automated equipment. This is a database that generally doesn't have indexes to speed inserts. But usually these types of databases are also not used for reading, the data is transferred instead daily to a reporting database which is indexed.
Yes, definitely - too many indexes on a table can be worse than no indexes at all. However, I don't think there's any good in having the "at most one index per table" rule.
For SQL Server, my rule is:
index any foreign key fields - this helps JOINs and is beneficial to other queries, too
index any other fields when it makes sense, e.g. when lots of intensive queries can benefit from it
Finding the right mix of indices - weighing the pros of speeding up queries vs. the cons of additional overhead on INSERT, UPDATE, DELETE - is not an exact science - it's more about know-how, experience, measuring, measuring, and measuring again.
Any fixed rule is bound to be more contraproductive than anything else.....
The best content on indexing comes from Kimberly Tripp - the Queen of Indexing - see her blog posts here.
Unless you like very slow reads, you should have indexes. Don't go overboard, but don't be afraid of being liberal about them either. EVERY FK should be indexed. You're going to do a look up each of these columns on inserts to other tables to make sure the references are set. The index helps. As well as the fact that indexed columns are used often in joins and selects.
We have some tables that are inserted into rarely, with millions of records. Some of these tables are also quite wide. It's not uncommon for these tables to have 15+ indexes. Other tables with heavy inserting and low reads we might only have a handful of indexes- but one index per table is crazy.
Updating an index is once per insert (per index). Speed gain is for every select. So if you update infrequently and read often, then the extra work may be well worth it.
If you do different selects (meaning the columns you filter on are different), then maintaining an index for each type of query is very useful. Provided you have a limited set of columns that you query often.
But the usual advice holds: if you want to know which is fastest: profile!
You should of course be careful not to create too many indexes per table, but only ever using a single index per table is not a useful level.
How many indexes to use depends on how the table is used. A table that is updated often would generally have less indexes than one that is read much more often than it's updated.
We have some tables that are updated regularly by a job every two minutes, but they are read often by queries that vary a lot, so they have several indexes. One table for example have 24 indexes.
So much depends on your schema and the queries that you normally run. For example: if you normally need to select above 60% of the rows of your table, indexes won't help you and it will be cheaper to table scan than to index scan and then lookup rows. Focused queries that select a small number of rows in different parts of the table or which are used for joins in queries will probably benefit from indexes. The right index in the right place can make or break a feature.
Indexes take space so making too many indexes on a table can be counter productive for the same reasons listed above. Scanning 5 indexes and then performing row lookups may be much more expensive than simply table scanning.
Good design is the synthesis about about knowing when to normalise and when not to.
If you frequently join on a particular column, check the IO plan with the index and without. As a general rule I avoid tables with more than 20 columns. This is often a sign that the data should be normalised. More than about 5 indexes on a table and you may be using more space for the indexes than the main table, be sure that is worth it. These rules are only the lightest of guidance and so much depends on how the data will be used in queries and what your data update profile looks like.
Experiment with your query plans to see how your solution improves or degrades with an index.
Every table must have a PK, which is indexed of course (generally a clustered one), then every FK should be indexed as well.
Finally you may want to index fields on which you often sort on, if their data is well differenciated: for a field with only 5 possible values in a table with 1 million records, an index will not be of a great benefit.
I tend to be minimalistic with indexes, until the db starts beeing well filled, and ...slower. It is easy to identify the bottlenecks and add just the right the indexes at that point.
Optimizing the retrieval with indexes must be carefully designed to reflect actual query patterns. Surely, for a table with Primary Key, you will have at least one clustered index (that's how data is actually stored), then any additional indexes are taking advantage of the layout of the data (clustered index).
After analyzing queries that execute against the table, you want to design an index(s) that cover them. That may mean building one or more indexes but that heavily depends on the queries themselves. That decision cannot be made just by looking at column statistics only.
For tables where it's mostly inserts, i.e. ETL tables or something, then you should not create Primary Keys, or actually drop indexes and re-create them if data changes too quickly or drop/recreated entirely.
I personally would be scared to step into an environment that has a hard-coded rule of indexes per table ratio.

Slow distinct query in SQL Server over large dataset

We're using SQL Server 2005 to track a fair amount of constantly incoming data (5-15 updates per second). We noticed after it has been in production for a couple months that one of the tables has started to take an obscene amount of time to query.
The table has 3 columns:
id -- autonumber (clustered)
typeUUID -- GUID generated before the insert happens; used to group the types together
typeName -- The type name (duh...)
One of the queries we run is a distinct on the typeName field:
SELECT DISTINCT [typeName] FROM [types] WITH (nolock);
The typeName field has a non-clusted, non-unique ascending index on it. The table contains approximately 200M records at the moment. When we run this query, the query took 5m 58s to return! Perhaps we're not understanding how the indexes work... But I didn't think we mis-understood them that much.
To test this a little further, we ran the following query:
SELECT DISTINCT [typeName] FROM (SELECT TOP 1000000 [typeName] FROM [types] WITH (nolock)) AS [subtbl]
This query returns in about 10 seconds, as I would expect, it's scanning the table.
Is there something we're missing here? Why does the first query take so long?
Edit: Ah, my apologies, the first query returns 76 records, thank you ninesided.
Follow up: Thank you all for your answers, it makes more sense to me now (I don't know why it didn't before...). Without an index, it's doing a table scan across 200M rows, with an index, it's doing an index scan across 200M rows...
SQL Server does prefer the index, and it does give a little bit of a performance boost, but nothing to be excited about. Rebuilding the index did take the query time down to just over 3m instead of 6m, an improvement, but not enough. I'm just going to recommend to my boss that we normalize the table structure.
Once again, thank you all for your help!!
You do misunderstand the index. Even if it did use the index it would still do an index scan across 200M entries. This is going to take a long time, plus the time it takes to do the DISTINCT (causes a sort) and it's a bad thing to run. Seeing a DISTINCT in a query always raises a red flag and causes me to double check the query. In this case, perhaps you have a normalization issue?
There is an issue with the SQL Server optimizer when using the DISTINCT keyword. The solution was to force it to keep the same query plan by breaking out the distinct query separately.
So we took queries such as:
SELECT DISTINCT [typeName] FROM [types] WITH (nolock);
and break it up into the following:
SELECT typeName INTO #tempTable1 FROM types WITH (NOLOCK)
SELECT DISTINCT typeName FROM #tempTable1
Another way to get around it is to use a GROUP BY, which gets a different optimization plan.
I doubt SQL Server will even try to use the index, it'd have to do practically the same amount of work (given the narrow table), reading all 200M rows regardless of whether it looks at the table or the index. If the index on typeName was clustered it may reduce the time taken as it shouldn't need to sort before grouping.
If the cardinality of your types is low, how about maintaining a summary table which holds the list of distinct type values? A trigger on insert/update of the main table would do a check on the summary table and insert a new record when a new type is found.
As others have already pointed out - when you do a SELECT DISTINCT (typename) over your table, you'll end up with a full table scan no matter what.
So it's really a matter of limiting the number of rows that need to be scanned.
The question is: what do you need your DISTINCT typenames for? And how many of your 200M rows are distinct? Do you have only a handful (a few hundred at most) distinct typenames??
If so - you could have a separate table DISTINCT_TYPENAMES or something and fill those initially by doing a full table scan, and then on inserting new rows to the main table, just always check whether their typename is already in DISTINCT_TYPENAMES, and if not, add it.
That way, you'd have a separate, small table with just the distinct TypeName entries, which would be lightning fast to query and/or to display.
Marc
A looping approach should use multiple seeks (but loses some parallelism). It might be worth a try for cases with relatively few distinct values compared to the total number of rows (low cardinality).
Idea was from this question:
select typeName into #Result from Types where 1=0;
declare #t varchar(100) = (select min(typeName) from Types);
while #t is not null
begin
set #t = (select top 1 typeName from Types where typeName > #t order by typeName);
if (#t is not null)
insert into #Result values (#t);
end
select * from #Result;
And looks like there are also some other methods (notably the recursive CTE #Paul White):
different-ways-to-find-distinct-values-faster-methods
sqlservercentral Topic873124-338-5
My first thought is statistics. To find last updated:
SELECT
name AS index_name,
STATS_DATE(object_id, index_id) AS statistics_update_date
FROM
sys.indexes
WHERE
object_id = OBJECT_ID('MyTable');
Edit: Stats are updated when indexes are rebuilt, which I see are not maintained
My second thought is that is the index still there? The TOP query should still use an index.
I've just tested on one of my tables with 57 million rows and both use the index.
An indexed view can make this faster.
create view alltypes
with schemabinding as
select typename, count_big(*) as kount
from dbo.types
group by typename
create unique clustered index idx
on alltypes (typename)
The work to keep the view up to date on each change to the base table should be moderate (depending on your application, of course -- my point is that it doesn't have to scan the whole table each time or do anything insanely expensive like that.)
Alternatively you could make a small table holding all values:
select distinct typename
into alltypes
from types
alter table alltypes
add primary key (typename)
alter table types add foreign key (typename) references alltypes
The foreign key will make sure that all values used appear in the parent alltypes table. The trouble is in ensuring that alltypes does not contain values not used in the child types table.
I should try something like this:
SELECT typeName FROM [types] WITH (nolock)
group by typeName;
And like other i would say you need to normalize that column.
An index helps you quickly find a row. But you're asking the database to list all unique types for the entire table. An index can't help with that.
You could run a nightly job which runs the query and stores it in a different table. If you require up-to-date data, you could store the last ID included in the nightly scan, and combine the results:
select type
from nightlyscan
union
select distinct type
from verybigtable
where rowid > lastscannedid
Another option is to normalize the big table into two tables:
talbe1: id, guid, typeid
type table: typeid, typename
This would be very beneficial if the number of types was relatively small.
I could be missing something but would it be more efficient if an overhead on load to create a view with distinct values and query that instead?
This would give almost instant responses to the select if the result set is significantly smaller with the overhead over populating it on each write though given the nature of the view that might be trivial in itself.
It does ask the question how many writes compared to how often you want the distinct and the importance of the speed when you do.

What is a Covered Index?

I've just heard the term covered index in some database discussion - what does it mean?
A covering index is an index that contains all of, and possibly more, the columns you need for your query.
For instance, this:
SELECT *
FROM tablename
WHERE criteria
will typically use indexes to speed up the resolution of which rows to retrieve using criteria, but then it will go to the full table to retrieve the rows.
However, if the index contained the columns column1, column2 and column3, then this sql:
SELECT column1, column2
FROM tablename
WHERE criteria
and, provided that particular index could be used to speed up the resolution of which rows to retrieve, the index already contains the values of the columns you're interested in, so it won't have to go to the table to retrieve the rows, but can produce the results directly from the index.
This can also be used if you see that a typical query uses 1-2 columns to resolve which rows, and then typically adds another 1-2 columns, it could be beneficial to append those extra columns (if they're the same all over) to the index, so that the query processor can get everything from the index itself.
Here's an article: Index Covering Boosts SQL Server Query Performance on the subject.
Covering index is just an ordinary index. It's called "covering" if it can satisfy query without necessity to analyze data.
example:
CREATE TABLE MyTable
(
ID INT IDENTITY PRIMARY KEY,
Foo INT
)
CREATE NONCLUSTERED INDEX index1 ON MyTable(ID, Foo)
SELECT ID, Foo FROM MyTable -- All requested data are covered by index
This is one of the fastest methods to retrieve data from SQL server.
Covering indexes are indexes which "cover" all columns needed from a specific table, removing the need to access the physical table at all for a given query/ operation.
Since the index contains the desired columns (or a superset of them), table access can be replaced with an index lookup or scan -- which is generally much faster.
Columns to cover:
parameterized or static conditions; columns restricted by a parameterized or constant condition.
join columns; columns dynamically used for joining
selected columns; to answer selected values.
While covering indexes can often provide good benefit for retrieval, they do add somewhat to insert/ update overhead; due to the need to write extra or larger index rows on every update.
Covering indexes for Joined Queries
Covering indexes are probably most valuable as a performance technique for joined queries. This is because joined queries are more costly & more likely then single-table retrievals to suffer high cost performance problems.
in a joined query, covering indexes should be considered per-table.
each 'covering index' removes a physical table access from the plan & replaces it with index-only access.
investigate the plan costs & experiment with which tables are most worthwhile to replace by a covering index.
by this means, the multiplicative cost of large join plans can be significantly reduced.
For example:
select oi.title, c.name, c.address
from porderitem poi
join porder po on po.id = poi.fk_order
join customer c on c.id = po.fk_customer
where po.orderdate > ? and po.status = 'SHIPPING';
create index porder_custitem on porder (orderdate, id, status, fk_customer);
See:
http://literatejava.com/sql/covering-indexes-query-optimization/
Lets say you have a simple table with the below columns, you have only indexed Id here:
Id (Int), Telephone_Number (Int), Name (VARCHAR), Address (VARCHAR)
Imagine you have to run the below query and check whether its using index, and whether performing efficiently without I/O calls or not. Remember, you have only created an index on Id.
SELECT Id FROM mytable WHERE Telephone_Number = '55442233';
When you check for performance on this query you will be dissappointed, since Telephone_Number is not indexed this needs to fetch rows from table using I/O calls. So, this is not a covering indexed since there is some column in query which is not indexed, which leads to frequent I/O calls.
To make it a covered index you need to create a composite index on (Id, Telephone_Number).
For more details, please refer to this blog:
https://www.percona.com/blog/2006/11/23/covering-index-and-prefix-indexes/