Poor performance of SQL query with Table Variable or User Defined Type - sql

I have a SELECT query on a view, that contains 500.000+ rows. Let's keep it simple:
SELECT * FROM dbo.Document WHERE MemberID = 578310
The query runs fast, ~0s
Let's rewrite it to work with the set of values, which reflects my needs more:
SELECT * FROM dbo.Document WHERE MemberID IN (578310)
This is same fast, ~0s
But now, the set is of IDs needs to be variable; let's define it as:
DECLARE #AuthorizedMembers TABLE
(
MemberID BIGINT NOT NULL PRIMARY KEY, --primary key
UNIQUE NONCLUSTERED (MemberID) -- and index, as if it could help...
);
INSERT INTO #AuthorizedMembers SELECT 578310
The set contains the same, one value but is a table variable now. The performance of such query drops to 2s, and in more complicated ones go as high as 25s and more, while with a fixed id it stays around ~0s.
SELECT *
FROM dbo.Document
WHERE MemberID IN (SELECT MemberID FROM #AuthorizedMembers)
is the same bad as:
SELECT *
FROM dbo.Document
WHERE EXISTS (SELECT MemberID
FROM #AuthorizedMembers
WHERE [#AuthorizedMembers].MemberID = Document.MemberID)
or as bad as this:
SELECT *
FROM dbo.Document
INNER JOIN #AuthorizedMembers AS AM ON AM.MemberID = Document.MemberID
The performance is same for all the above and always much worse than the one with a fixed value.
The dynamic SQL comes with help easily, so creating an nvarchar like (id1,id2,id3) and building a fixed query with it keeps my query times ~0s. But I would like to avoid using Dynamic SQL as much as possible and if I do, I would like to keep it always the same string, regardless the values (using parameters - which above method does not allow).
Any ideas how to get the performance of the table variable similar to a fixed array of values or avoid building a different dynamic SQL code for each run?
P.S. I have tried the above with a user defined type with same results
Edit:
The results with a temporary table, defined as:
CREATE TABLE #AuthorizedMembers
(
MemberID BIGINT NOT NULL PRIMARY KEY
);
INSERT INTO #AuthorizedMembers SELECT 578310
have improved the execution time up to 3 times. (13s -> 4s). Which is still significantly higher than dynamic SQL <1s.

Your options:
Use a temporary table instead of a TABLE variable
If you insist on using a TABLE variable, add OPTION(RECOMPILE) at the end of your query
Explanation:
When the compiler compiles your statement, the TABLE variable has no rows in it and therefore doesn't have the proper cardinalities. This results in an inefficient execution plan. OPTION(RECOMPILE) forces the statement to be recompiled when it is run. At that point the TABLE variable has rows in it and the compiler has better cardinalities to produce an execution plan.
The general rule of thumb is to use temporary tables when operating on large datasets and table variables for small datasets with frequent updates. Personally I only very rarely use TABLE variables because they generally perform poorly.
I can recommend this answer on the question "What's the difference between temporary tables and table variables in SQL Server?" if you want an in-depth analysis on the differences.

Related

Is it better to use parameter or column value when copying data from one table to another?

I have a SQL statement to copy records from one table to another:
INSERT INTO [deletedItems] (
[id],
[shopId])
SELECT
[id],
[shopId]
FROM [items]
WHERE shopId = #ShopId
#ShopId is a parameter provided to the sql command when calling the db from my application code.
Will it make the statement perform better if I change it to use the provided parameter directly, so the SQL server does not have to include shopId column from products table in the projection?
INSERT INTO [deletedItems](
[id],
[shopId])
SELECT
[id],
#ShopId
FROM [items]
WHERE shopId = #ShopId
Intuition is telling me yes, but at the same time, I would expect the sql server to optimize the execution plan of the first query and ommit the projection of the shopId column anyways (because the value will be the same for all the records) and use a constant value instead.
I would expect the sql server to optimize the execution plan of the
first query and ommit the projection of the shopId column anyways
(because the value will be the same for all the records) and use a
constant value instead.
No, SQL Server, does not do this. You can verify this by looking at the execution plan and the "output columns" for the operator accessing items.
In the general case this is not a safe transformation and can lead to lost information. For example if the source matches the rows
+--------+
| ShopId |
+--------+
| A123 |
| a123 |
+--------+
Then on a case insensitive collation both would match the same predicate and should be inserted but are different.
If one of the following applies
You are using a datatype where this is not possible
You know that this is not an issue in your data - e.g. as check constraints ensure all data is stored trimmed and upper case.
Are happy for a canonical representation to be used for all rows if it is an issue.
then it is possible to come up with convoluted scenarios where your manual optimisation makes sense as below
CREATE TABLE #T(X INT IDENTITY, Y CHAR(4000));
INSERT INTO #T
SELECT TOP 1000000 REPLICATE('A',4000)
FROM sys.all_objects o1, sys.all_objects o2
SELECT X, Y
FROM #T
WHERE Y = REPLICATE('A',4000)
ORDER BY X
SELECT X, REPLICATE('A',4000) AS Y
FROM #T
WHERE Y = REPLICATE('A',4000)
ORDER BY X
The size of the rows going into the sort operator is much bigger in the first case as it includes the large string column and the sort spills to tempdb. The query execution takes substantially longer as a result. The memory grant request for the second query is the same as that of the first as it does not take into account that the column is computed after the sort but there is less data to sort and it does not spill. On versions of SQL Server where adaptive memory grant feedback is available the excessive grant would be corrected if the query is executed repeatedly.
In most real world scenarios I doubt the manual optimisation will make any practical difference however so you should choose whichever one does what you need and you feel is clearer and concentrate optimisation efforts in more promising areas (for me the second one makes clearer that the same value will be inserted in all rows).
I don't except any performances differences. The slow part will be finding the correct items by #ShopID or the IO operations.
What can improve your query performance is having an index on [ShopID] column where ID is primary key or included column.
Will it makes the statement perform better if I change it to use the
provided parameter directly
It's the same. Because your result just has a unique ShopId as Where clause.
INSERT INTO [deletedItems] (
[id],
[shopId])
SELECT
[id],
[shopId]
FROM [items]
WHERE shopId = #ShopId -- this condition makes the `shopId` value is become unique
Two important points.
The calculation of scalar expressions in the SELECT (generally) has little impact on query performance. The performance is determined by data movement.
So, selecting a "constant" versus selecting a column from a table is immaterial.
Second, if you care about performance, you need to be very careful about query plans. Either force the use of an index or be sure that the query gets recompiled periodically as the data changes in your tables.
In particular, you want to be sure that the query uses an index on items(shopId) if the table spans multiple data pages.

Join to SELECT vs. Join to Tableset

For the DB gurus out there, I was wondering if there is any functional/performance difference between Joining to the results a SELECT statement and Joining to a previously filled table variable. I'm working in SQL Server 2008 R2.
Example (TSQL):
-- Create a test table
DROP TABLE [dbo].[TestTable]
CREATE TABLE [dbo].[TestTable](
[id] [int] NOT NULL,
[value] [varchar](max) NULL
) ON [PRIMARY]
-- Populate the test table with a few rows
INSERT INTO [dbo].[TestTable]
SELECT 1123, 'test1'
INSERT INTO [dbo].[TestTable]
SELECT 2234, 'test2'
INSERT INTO [dbo].[TestTable]
SELECT 3345, 'test3'
-- Create a reference table
DROP TABLE [dbo].[TestRefTable]
CREATE TABLE [dbo].[TestRefTable](
[id] [int] NOT NULL,
[refvalue] [varchar](max) NULL
) ON [PRIMARY]
-- Populate the reference table with a few rows
INSERT INTO [dbo].[TestRefTable]
SELECT 1123, 'ref1'
INSERT INTO [dbo].[TestRefTable]
SELECT 2234, 'ref2'
-- Scenario 1: Insert matching results into it's own table variable, then Join
-- Create a table variable
DECLARE #subset TABLE ([id] INT NOT NULL, [refvalue] VARCHAR(MAX))
INSERT INTO #subset
SELECT * FROM [dbo].[TestRefTable]
WHERE [dbo].[TestRefTable].[id] = 1123
SELECT t.*, s.*
FROM [dbo].[TestTable] t
JOIN #subset s
ON t.id = s.id
-- Scenario 2: Join directly to SELECT results
SELECT t.*, s.*
FROM [dbo].TestTable t
JOIN (SELECT * FROM [dbo].[TestRefTable] WHERE id = 1123) s
ON t.id = s.id
In the "real" world, the tables and table variable are pre-defined. What I'm looking at is being able to have the matched reference rows available for further operations, but I'm concerned that the extra steps will slow the query down. Are there technical reasons as to why one would be faster than the other? What sort of performance difference may be seen between the two approaches? I realize it is difficult (if not impossible) to give a definitive answer, just looking for some advice for this scenario.
The database engine has an optimizer to figure out the best way to execute a query. There is more under the hood than you probably imagine. For instance, when SQL Server is doing a join, it has a choice of at least four join algorithms:
Nested Loop
Index Lookup
Merge Join
Hash Join
(not to mention the multi-threaded versions of these.)
It is not important that you understand how each of these works. You just need to understand two things: different algorithms are best under different circumstances and SQL Server does its best to choose the best algorithm.
The choice of join algorithm is only one thing the optimizer does. It also has to figure out the ordering of the joins, the best way to aggregate results, whether a sort is needed for an order by, how to access the data (via indexes or directly), and much more.
When you break the query apart, you are making an assumption about optimization. In your case, you are making the assumption that the first best thing is to do a select on a particular table. You might be right. If so, your result with multiple queries should be about as fast as using a single query. Well, maybe not. When in a single query, SQL Server does not have to buffer all the results at once; it can stream results from one place to another. It may also be able to take advantage of parallelism in a way that splitting the query prevents.
In general, the SQL Server optimizer is pretty good, so you are best letting the optimizer do the query all in one go. There are definitely exceptions, where the optimizer may not choose the best execution path. Sometimes fixing this is as easy as being sure that statistics are up-to-date on tables. Other times, you can add optimizer hints. And other times you can restructure the query, as you have done.
For instance, one place where loading data into a local table is useful is when the table comes from a different server. The optimizer may not have full information about the size of the table to make the best decisions.
In other words, keep the query as one statement. If you need to improve it, then focus on optimization after it works. You generally won't have to spend much time on optimization, because the engine is pretty good at it.
This would give the same result?
SELECT t.*, s.*
FROM dbo.TestTable AS t
JOIN dbo.TestRefTable AS s ON t.id = s.id AND s.id = 1123
Basically, this is a cross join of all records from TestTable and TestRefTable with id = 1123.
Joining to table variables will also result in bad cardinality estimates by the optimizer. Table variables are always assumed by the optimizer to contain only a single row. The more rows it actually has the worse that estimate becomes. This causes the optimizer to assume the wrong number of rows for the table itself, but in other places, for operators that might then join to that result, it can result in wrong estimations of the number executions for that operation.
Personally I think Table parameters should be used for getting data into and out of the server conveniently using client apps (C# .Net apps make good use of them), or for passing data between Stored Procs, but should not be used too much within the proc itself. The importance of getting rid of them within the Proc code itself increases with the expected number of rows to be carried by the parameter.
Sub Selects will perform better, or immediately copying into a temp table will work well. There is overhead for copying into the temp table, but again, the more rows you have the more worth it that overhead becomes because the estimates by the optimizer get worse and worse.
In general a derived table in the query is probably going to be faster than joining to a table variable because it can make use of indexes and they are not available in table variables. However, temp tables can also have indexes creted and that might solve the potential performance difference.
Also if the number of table variable records is expected to be small, then indexes won't make a great deal of difference anyway and so there would be little or no differnce.
As alawys you need to test on your own system as number of records and table design and index design havea great deal to do with what works best.
I'd expect the direct Table join to be faster than the Table to TableVariable, and use less resources.

SQL "WITH" Performance and Temp Table (possible "Query Hint" to simplify)

Given the example queries below (Simplified examples only)
DECLARE #DT int; SET #DT=20110717; -- yes this is an INT
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
and ...
DECLARE #DT int; SET #DT=20110717;
BEGIN TRY DROP TABLE #LargeData END TRY BEGIN CATCH END CATCH; -- dump any possible table.
SELECT * -- This is a MASSIVE table indexed on dt field
INTO #LargeData -- put smaller results into temp
FROM mydata
WHERE dt=#DT;
WITH Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM #LargeData
)
SELECT * FROM Ordered
Both produce the same results, which is a limited and ranked list of values from a list based on a fields data.
When these queries get considerably more complicated (many more tables, lots of criteria, multiple levels of "with" table alaises, etc...) the bottom query executes MUCH faster then the top one. Sometimes in the order of 20x-100x faster.
The Question is...
Is there some kind of query HINT or other SQL option that would tell the SQL Server to perform the same kind of optimization automatically, or other formats of this that would involve a cleaner aproach (trying to keep the format as much like query 1 as possible) ?
Note that the "Ranking" or secondary queries is just fluff for this example, the actual operations performed really don't matter too much.
This is sort of what I was hoping for (or similar but the idea is clear I hope). Remember this query below does not actually work.
DECLARE #DT int; SET #DT=20110717;
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
**OPTION (USE_TEMP_OR_HARDENED_OR_SOMETHING) -- EXAMPLE ONLY**
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
EDIT: Important follow up information!
If in your sub query you add
TOP 999999999 -- improves speed dramatically
Your query will behave in a similar fashion to using a temp table in a previous query. I found the execution times improved in almost the exact same fashion. WHICH IS FAR SIMPLIER then using a temp table and is basically what I was looking for.
However
TOP 100 PERCENT -- does NOT improve speed
Does NOT perform in the same fashion (you must use the static Number style TOP 999999999 )
Explanation:
From what I can tell from the actual execution plan of the query in both formats (original one with normal CTE's and one with each sub query having TOP 99999999)
The normal query joins everything together as if all the tables are in one massive query, which is what is expected. The filtering criteria is applied almost at the join points in the plan, which means many more rows are being evaluated and joined together all at once.
In the version with TOP 999999999, the actual execution plan clearly separates the sub querys from the main query in order to apply the TOP statements action, thus forcing creation of an in memory "Bitmap" of the sub query that is then joined to the main query. This appears to actually do exactly what I wanted, and in fact it may even be more efficient since servers with large ammounts of RAM will be able to do the query execution entirely in MEMORY without any disk IO. In my case we have 280 GB of RAM so well more then could ever really be used.
Not only can you use indexes on temp tables but they allow the use of statistics and the use of hints. I can find no refernce to being able to use the statistics in the documentation on CTEs and it says specifically you cann't use hints.
Temp tables are often the most performant way to go when you have a large data set when the choice is between temp tables and table variables even when you don't use indexes (possobly because it will use statistics to develop the plan) and I might suspect the implementation of the CTE is more like the table varaible than the temp table.
I think the best thing to do though is see how the excutionplans are different to determine if it is something that can be fixed.
What exactly is your objection to using the temp table when you know it performs better?
The problem is that in the first query SQL Server query optimizer is able to generate a query plan. In the second query a good query plan isn't able to be generated because you're inserting the values into a new temporary table. My guess is there is a full table scan going on somewhere that you're not seeing.
What you may want to do in the second query is insert the values into the #LargeData temporary table like you already do and then create a non-clustered index on the "valuefield" column. This might help to improve your performance.
It is quite possible that SQL is optimizing for the wrong value of the parameters.
There are a couple of options
Try using option(RECOMPILE). There is a cost to this as it recompiles the query every time but if different plans are needed it might be worth it.
You could also try using OPTION(OPTIMIZE FOR #DT=SomeRepresentatvieValue) The problem with this is you pick the wrong value.
See I Smell a Parameter! from The SQL Server Query Optimization Team blog

Unexpected #temp table performance

Bounty open:
Ok people, the boss needs an answer and I need a pay rise. It doesn't seem to be a cold caching issue.
UPDATE:
I've followed the advice below to no avail. How ever the client statistics threw up an interesting set of number.
#temp vs #temp
Number of INSERT, DELETE and UPDATE statements
0 vs 1
Rows affected by INSERT, DELETE, or UPDATE statements
0 vs 7647
Number of SELECT statements
0 vs 0
Rows returned by SELECT statements
0 vs 0
Number of transactions
0 vs 1
The most interesting being the number of rows affected and the number of transactions. To remind you, the queries below return identical results set, just into different styles of tables.
The following query are basicaly doing the same thing. They both select a set of results (about 7000) and populate this into either a temp or var table. In my mind the var table #temp should be created and populated quicker than the temp table #temp however the var table in the first example takes 1min 15sec to execute where as the temp table in the second example takes 16 seconds.
Can anyone offer an explanation?
declare #temp table (
id uniqueidentifier,
brand nvarchar(255),
field nvarchar(255),
date datetime,
lang nvarchar(5),
dtype varchar(50)
)
insert into #temp (id, brand, field, date, lang, dtype )
select id, brand, field, date, lang, dtype
from view
where brand = 'myBrand'
-- takes 1:15
vs
select id, brand, field, date, lang, dtype
into #temp
from view
where brand = 'myBrand'
DROP TABLE #temp
-- takes 16 seconds
I believe this almost completely comes down to table variable vs. temp table performance.
Table variables are optimized for having exactly one row. When the query optimizer chooses an execution plan, it does it on the (often false) assumption that that the table variable only has a single row.
I can't find a good source for this, but it is at least mentioned here:
http://technet.microsoft.com/en-us/magazine/2007.11.sqlquery.aspx
Other related sources:
http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=125052
http://databases.aspfaq.com/database/should-i-use-a-temp-table-or-a-table-variable.html
Run both with SET STATISTICS IO ON and SET STATISTICS TIME ON. Run 6-7 times each, discard the best and worst results for both cases, then compare the two average times.
I suspect the difference is primarily from a cold cache (first execution) vs. a warm cache (second execution). The output from STATISTICS IO would give away such a case, as a big difference in the physical reads between the runs.
And make sure you have 'lab' conditions for the test: no other tasks running (no lock contention), databases (including tempdb) and logs are pre-grown to required size so you don't hit any log growth or database growth event.
This is not uncommon. Table variables can be (and in a lot of cases ARE) slower than temp tables. Here are some of the reasons for this:
SQL Server maintains statistics for queries that use temporary tables but not for queries that use table variables. Without statistics, SQL Server might choose a poor processing plan for a query that contains a table variable
Non-clustered indexes cannot be created on table variables, other than the system indexes that are created for a PRIMARY or UNIQUE constraint. That can influence the query performance when compared to a temporary table with non-clustered indexes.
table variables use internal metadata in a way that prevents the engine from using a table variable within a parallel query (this means that it wont take advantage of multi-processor machines).
A table variable is optimized for one row, by SQL Server (it assumes 1 row will be returned).
I'm not 100% that this is the cause, but the table var will not have any statistics whereas the temp table will.
SELECT INTO is a non-logged operation, which would likely explain most of the performance difference. INSERT creates a log entry for every operation.
Additionally, SELECT INTO is creating the table as part of the operation, so SQL Server knows automatically that there are no constraints on it, which may factor in.
If it takes over a full minute to insert 7000 records into a temp table (persistent or variable), then the perf issue is almost certainly in the SELECT statement that's populating it.
Have you run DBCC FREEPROCCACHE and DBCC DROPCLEANBUFFERS before profiling? I'm thinking that maybe it's using some cached results for the second query.

Slow distinct query in SQL Server over large dataset

We're using SQL Server 2005 to track a fair amount of constantly incoming data (5-15 updates per second). We noticed after it has been in production for a couple months that one of the tables has started to take an obscene amount of time to query.
The table has 3 columns:
id -- autonumber (clustered)
typeUUID -- GUID generated before the insert happens; used to group the types together
typeName -- The type name (duh...)
One of the queries we run is a distinct on the typeName field:
SELECT DISTINCT [typeName] FROM [types] WITH (nolock);
The typeName field has a non-clusted, non-unique ascending index on it. The table contains approximately 200M records at the moment. When we run this query, the query took 5m 58s to return! Perhaps we're not understanding how the indexes work... But I didn't think we mis-understood them that much.
To test this a little further, we ran the following query:
SELECT DISTINCT [typeName] FROM (SELECT TOP 1000000 [typeName] FROM [types] WITH (nolock)) AS [subtbl]
This query returns in about 10 seconds, as I would expect, it's scanning the table.
Is there something we're missing here? Why does the first query take so long?
Edit: Ah, my apologies, the first query returns 76 records, thank you ninesided.
Follow up: Thank you all for your answers, it makes more sense to me now (I don't know why it didn't before...). Without an index, it's doing a table scan across 200M rows, with an index, it's doing an index scan across 200M rows...
SQL Server does prefer the index, and it does give a little bit of a performance boost, but nothing to be excited about. Rebuilding the index did take the query time down to just over 3m instead of 6m, an improvement, but not enough. I'm just going to recommend to my boss that we normalize the table structure.
Once again, thank you all for your help!!
You do misunderstand the index. Even if it did use the index it would still do an index scan across 200M entries. This is going to take a long time, plus the time it takes to do the DISTINCT (causes a sort) and it's a bad thing to run. Seeing a DISTINCT in a query always raises a red flag and causes me to double check the query. In this case, perhaps you have a normalization issue?
There is an issue with the SQL Server optimizer when using the DISTINCT keyword. The solution was to force it to keep the same query plan by breaking out the distinct query separately.
So we took queries such as:
SELECT DISTINCT [typeName] FROM [types] WITH (nolock);
and break it up into the following:
SELECT typeName INTO #tempTable1 FROM types WITH (NOLOCK)
SELECT DISTINCT typeName FROM #tempTable1
Another way to get around it is to use a GROUP BY, which gets a different optimization plan.
I doubt SQL Server will even try to use the index, it'd have to do practically the same amount of work (given the narrow table), reading all 200M rows regardless of whether it looks at the table or the index. If the index on typeName was clustered it may reduce the time taken as it shouldn't need to sort before grouping.
If the cardinality of your types is low, how about maintaining a summary table which holds the list of distinct type values? A trigger on insert/update of the main table would do a check on the summary table and insert a new record when a new type is found.
As others have already pointed out - when you do a SELECT DISTINCT (typename) over your table, you'll end up with a full table scan no matter what.
So it's really a matter of limiting the number of rows that need to be scanned.
The question is: what do you need your DISTINCT typenames for? And how many of your 200M rows are distinct? Do you have only a handful (a few hundred at most) distinct typenames??
If so - you could have a separate table DISTINCT_TYPENAMES or something and fill those initially by doing a full table scan, and then on inserting new rows to the main table, just always check whether their typename is already in DISTINCT_TYPENAMES, and if not, add it.
That way, you'd have a separate, small table with just the distinct TypeName entries, which would be lightning fast to query and/or to display.
Marc
A looping approach should use multiple seeks (but loses some parallelism). It might be worth a try for cases with relatively few distinct values compared to the total number of rows (low cardinality).
Idea was from this question:
select typeName into #Result from Types where 1=0;
declare #t varchar(100) = (select min(typeName) from Types);
while #t is not null
begin
set #t = (select top 1 typeName from Types where typeName > #t order by typeName);
if (#t is not null)
insert into #Result values (#t);
end
select * from #Result;
And looks like there are also some other methods (notably the recursive CTE #Paul White):
different-ways-to-find-distinct-values-faster-methods
sqlservercentral Topic873124-338-5
My first thought is statistics. To find last updated:
SELECT
name AS index_name,
STATS_DATE(object_id, index_id) AS statistics_update_date
FROM
sys.indexes
WHERE
object_id = OBJECT_ID('MyTable');
Edit: Stats are updated when indexes are rebuilt, which I see are not maintained
My second thought is that is the index still there? The TOP query should still use an index.
I've just tested on one of my tables with 57 million rows and both use the index.
An indexed view can make this faster.
create view alltypes
with schemabinding as
select typename, count_big(*) as kount
from dbo.types
group by typename
create unique clustered index idx
on alltypes (typename)
The work to keep the view up to date on each change to the base table should be moderate (depending on your application, of course -- my point is that it doesn't have to scan the whole table each time or do anything insanely expensive like that.)
Alternatively you could make a small table holding all values:
select distinct typename
into alltypes
from types
alter table alltypes
add primary key (typename)
alter table types add foreign key (typename) references alltypes
The foreign key will make sure that all values used appear in the parent alltypes table. The trouble is in ensuring that alltypes does not contain values not used in the child types table.
I should try something like this:
SELECT typeName FROM [types] WITH (nolock)
group by typeName;
And like other i would say you need to normalize that column.
An index helps you quickly find a row. But you're asking the database to list all unique types for the entire table. An index can't help with that.
You could run a nightly job which runs the query and stores it in a different table. If you require up-to-date data, you could store the last ID included in the nightly scan, and combine the results:
select type
from nightlyscan
union
select distinct type
from verybigtable
where rowid > lastscannedid
Another option is to normalize the big table into two tables:
talbe1: id, guid, typeid
type table: typeid, typename
This would be very beneficial if the number of types was relatively small.
I could be missing something but would it be more efficient if an overhead on load to create a view with distinct values and query that instead?
This would give almost instant responses to the select if the result set is significantly smaller with the overhead over populating it on each write though given the nature of the view that might be trivial in itself.
It does ask the question how many writes compared to how often you want the distinct and the importance of the speed when you do.