Is it better to use parameter or column value when copying data from one table to another? - sql

I have a SQL statement to copy records from one table to another:
INSERT INTO [deletedItems] (
[id],
[shopId])
SELECT
[id],
[shopId]
FROM [items]
WHERE shopId = #ShopId
#ShopId is a parameter provided to the sql command when calling the db from my application code.
Will it make the statement perform better if I change it to use the provided parameter directly, so the SQL server does not have to include shopId column from products table in the projection?
INSERT INTO [deletedItems](
[id],
[shopId])
SELECT
[id],
#ShopId
FROM [items]
WHERE shopId = #ShopId
Intuition is telling me yes, but at the same time, I would expect the sql server to optimize the execution plan of the first query and ommit the projection of the shopId column anyways (because the value will be the same for all the records) and use a constant value instead.

I would expect the sql server to optimize the execution plan of the
first query and ommit the projection of the shopId column anyways
(because the value will be the same for all the records) and use a
constant value instead.
No, SQL Server, does not do this. You can verify this by looking at the execution plan and the "output columns" for the operator accessing items.
In the general case this is not a safe transformation and can lead to lost information. For example if the source matches the rows
+--------+
| ShopId |
+--------+
| A123 |
| a123 |
+--------+
Then on a case insensitive collation both would match the same predicate and should be inserted but are different.
If one of the following applies
You are using a datatype where this is not possible
You know that this is not an issue in your data - e.g. as check constraints ensure all data is stored trimmed and upper case.
Are happy for a canonical representation to be used for all rows if it is an issue.
then it is possible to come up with convoluted scenarios where your manual optimisation makes sense as below
CREATE TABLE #T(X INT IDENTITY, Y CHAR(4000));
INSERT INTO #T
SELECT TOP 1000000 REPLICATE('A',4000)
FROM sys.all_objects o1, sys.all_objects o2
SELECT X, Y
FROM #T
WHERE Y = REPLICATE('A',4000)
ORDER BY X
SELECT X, REPLICATE('A',4000) AS Y
FROM #T
WHERE Y = REPLICATE('A',4000)
ORDER BY X
The size of the rows going into the sort operator is much bigger in the first case as it includes the large string column and the sort spills to tempdb. The query execution takes substantially longer as a result. The memory grant request for the second query is the same as that of the first as it does not take into account that the column is computed after the sort but there is less data to sort and it does not spill. On versions of SQL Server where adaptive memory grant feedback is available the excessive grant would be corrected if the query is executed repeatedly.
In most real world scenarios I doubt the manual optimisation will make any practical difference however so you should choose whichever one does what you need and you feel is clearer and concentrate optimisation efforts in more promising areas (for me the second one makes clearer that the same value will be inserted in all rows).

I don't except any performances differences. The slow part will be finding the correct items by #ShopID or the IO operations.
What can improve your query performance is having an index on [ShopID] column where ID is primary key or included column.

Will it makes the statement perform better if I change it to use the
provided parameter directly
It's the same. Because your result just has a unique ShopId as Where clause.
INSERT INTO [deletedItems] (
[id],
[shopId])
SELECT
[id],
[shopId]
FROM [items]
WHERE shopId = #ShopId -- this condition makes the `shopId` value is become unique

Two important points.
The calculation of scalar expressions in the SELECT (generally) has little impact on query performance. The performance is determined by data movement.
So, selecting a "constant" versus selecting a column from a table is immaterial.
Second, if you care about performance, you need to be very careful about query plans. Either force the use of an index or be sure that the query gets recompiled periodically as the data changes in your tables.
In particular, you want to be sure that the query uses an index on items(shopId) if the table spans multiple data pages.

Related

Poor performance of SQL query with Table Variable or User Defined Type

I have a SELECT query on a view, that contains 500.000+ rows. Let's keep it simple:
SELECT * FROM dbo.Document WHERE MemberID = 578310
The query runs fast, ~0s
Let's rewrite it to work with the set of values, which reflects my needs more:
SELECT * FROM dbo.Document WHERE MemberID IN (578310)
This is same fast, ~0s
But now, the set is of IDs needs to be variable; let's define it as:
DECLARE #AuthorizedMembers TABLE
(
MemberID BIGINT NOT NULL PRIMARY KEY, --primary key
UNIQUE NONCLUSTERED (MemberID) -- and index, as if it could help...
);
INSERT INTO #AuthorizedMembers SELECT 578310
The set contains the same, one value but is a table variable now. The performance of such query drops to 2s, and in more complicated ones go as high as 25s and more, while with a fixed id it stays around ~0s.
SELECT *
FROM dbo.Document
WHERE MemberID IN (SELECT MemberID FROM #AuthorizedMembers)
is the same bad as:
SELECT *
FROM dbo.Document
WHERE EXISTS (SELECT MemberID
FROM #AuthorizedMembers
WHERE [#AuthorizedMembers].MemberID = Document.MemberID)
or as bad as this:
SELECT *
FROM dbo.Document
INNER JOIN #AuthorizedMembers AS AM ON AM.MemberID = Document.MemberID
The performance is same for all the above and always much worse than the one with a fixed value.
The dynamic SQL comes with help easily, so creating an nvarchar like (id1,id2,id3) and building a fixed query with it keeps my query times ~0s. But I would like to avoid using Dynamic SQL as much as possible and if I do, I would like to keep it always the same string, regardless the values (using parameters - which above method does not allow).
Any ideas how to get the performance of the table variable similar to a fixed array of values or avoid building a different dynamic SQL code for each run?
P.S. I have tried the above with a user defined type with same results
Edit:
The results with a temporary table, defined as:
CREATE TABLE #AuthorizedMembers
(
MemberID BIGINT NOT NULL PRIMARY KEY
);
INSERT INTO #AuthorizedMembers SELECT 578310
have improved the execution time up to 3 times. (13s -> 4s). Which is still significantly higher than dynamic SQL <1s.
Your options:
Use a temporary table instead of a TABLE variable
If you insist on using a TABLE variable, add OPTION(RECOMPILE) at the end of your query
Explanation:
When the compiler compiles your statement, the TABLE variable has no rows in it and therefore doesn't have the proper cardinalities. This results in an inefficient execution plan. OPTION(RECOMPILE) forces the statement to be recompiled when it is run. At that point the TABLE variable has rows in it and the compiler has better cardinalities to produce an execution plan.
The general rule of thumb is to use temporary tables when operating on large datasets and table variables for small datasets with frequent updates. Personally I only very rarely use TABLE variables because they generally perform poorly.
I can recommend this answer on the question "What's the difference between temporary tables and table variables in SQL Server?" if you want an in-depth analysis on the differences.

sql select condition performance

I have a table 'Tab' with data such as:
id | value
---------------
1 | Germany
2 | Argentina
3 | Brasil
4 | Holland
What way of select is better by perfomane?
1. SELECT * FROM Tab WHERE value IN ('Argentina', 'Holland')
or
2. SELECT * FROM Tab WHERE id IN (2, 4)
I suppose that second select would be faster, because int comparison is faster than string. Is that true for MS SQL?
This is a premature optimization. The comparison between integers and strings is generally going to have a minimal impact on query performance. The drivers of query performance are more along the lines of tables sizes, query plans, available memory, and competition for resources.
In general, it is a good idea to have indexes on columns used for either comparison. The first column looks like a primary key, so it automatically gets an index. The string column should have an index built on it. In general, indexes built on an integer column will have marginally better performance compared to integers built on variable length string columns. However, this type of performance difference really makes a difference only in environments with very high levels of transactions (think thousands of data modification operations per second).
You should use the logic that best fits the application and worry about other aspects of the code.
To answer the simple question yes option 2 SELECT * FROM Tab WHERE id IN (2, 4) would be faster as you said because int comparison is faster.
One way to speed it up is to add indexes to your columns to speed up evaluation, filtering, and the final retrieval of results.
If this table was to grow even more you should also not SELECT * but SELECT id, value otherwise you may be pulling more data than you need.
You can also speed up your query's buy adding WITH(NOLOCK) as the speed of your query might be affected by other sessions accessing the tables at the same time. For example SELECT * FROM Tab WITH(NOLOCK) WHERE id IN (2, 4) . As mentioned below though adding nolock is not a turbo and should only be used in appropriate situations.

SQL "WITH" Performance and Temp Table (possible "Query Hint" to simplify)

Given the example queries below (Simplified examples only)
DECLARE #DT int; SET #DT=20110717; -- yes this is an INT
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
and ...
DECLARE #DT int; SET #DT=20110717;
BEGIN TRY DROP TABLE #LargeData END TRY BEGIN CATCH END CATCH; -- dump any possible table.
SELECT * -- This is a MASSIVE table indexed on dt field
INTO #LargeData -- put smaller results into temp
FROM mydata
WHERE dt=#DT;
WITH Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM #LargeData
)
SELECT * FROM Ordered
Both produce the same results, which is a limited and ranked list of values from a list based on a fields data.
When these queries get considerably more complicated (many more tables, lots of criteria, multiple levels of "with" table alaises, etc...) the bottom query executes MUCH faster then the top one. Sometimes in the order of 20x-100x faster.
The Question is...
Is there some kind of query HINT or other SQL option that would tell the SQL Server to perform the same kind of optimization automatically, or other formats of this that would involve a cleaner aproach (trying to keep the format as much like query 1 as possible) ?
Note that the "Ranking" or secondary queries is just fluff for this example, the actual operations performed really don't matter too much.
This is sort of what I was hoping for (or similar but the idea is clear I hope). Remember this query below does not actually work.
DECLARE #DT int; SET #DT=20110717;
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
**OPTION (USE_TEMP_OR_HARDENED_OR_SOMETHING) -- EXAMPLE ONLY**
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
EDIT: Important follow up information!
If in your sub query you add
TOP 999999999 -- improves speed dramatically
Your query will behave in a similar fashion to using a temp table in a previous query. I found the execution times improved in almost the exact same fashion. WHICH IS FAR SIMPLIER then using a temp table and is basically what I was looking for.
However
TOP 100 PERCENT -- does NOT improve speed
Does NOT perform in the same fashion (you must use the static Number style TOP 999999999 )
Explanation:
From what I can tell from the actual execution plan of the query in both formats (original one with normal CTE's and one with each sub query having TOP 99999999)
The normal query joins everything together as if all the tables are in one massive query, which is what is expected. The filtering criteria is applied almost at the join points in the plan, which means many more rows are being evaluated and joined together all at once.
In the version with TOP 999999999, the actual execution plan clearly separates the sub querys from the main query in order to apply the TOP statements action, thus forcing creation of an in memory "Bitmap" of the sub query that is then joined to the main query. This appears to actually do exactly what I wanted, and in fact it may even be more efficient since servers with large ammounts of RAM will be able to do the query execution entirely in MEMORY without any disk IO. In my case we have 280 GB of RAM so well more then could ever really be used.
Not only can you use indexes on temp tables but they allow the use of statistics and the use of hints. I can find no refernce to being able to use the statistics in the documentation on CTEs and it says specifically you cann't use hints.
Temp tables are often the most performant way to go when you have a large data set when the choice is between temp tables and table variables even when you don't use indexes (possobly because it will use statistics to develop the plan) and I might suspect the implementation of the CTE is more like the table varaible than the temp table.
I think the best thing to do though is see how the excutionplans are different to determine if it is something that can be fixed.
What exactly is your objection to using the temp table when you know it performs better?
The problem is that in the first query SQL Server query optimizer is able to generate a query plan. In the second query a good query plan isn't able to be generated because you're inserting the values into a new temporary table. My guess is there is a full table scan going on somewhere that you're not seeing.
What you may want to do in the second query is insert the values into the #LargeData temporary table like you already do and then create a non-clustered index on the "valuefield" column. This might help to improve your performance.
It is quite possible that SQL is optimizing for the wrong value of the parameters.
There are a couple of options
Try using option(RECOMPILE). There is a cost to this as it recompiles the query every time but if different plans are needed it might be worth it.
You could also try using OPTION(OPTIMIZE FOR #DT=SomeRepresentatvieValue) The problem with this is you pick the wrong value.
See I Smell a Parameter! from The SQL Server Query Optimization Team blog

What is the reason not to use select *?

I've seen a number of people claim that you should specifically name each column you want in your select query.
Assuming I'm going to use all of the columns anyway, why would I not use SELECT *?
Even considering the question *SQL query - Select * from view or Select col1, col2, … colN from view*, I don't think this is an exact duplicate as I'm approaching the issue from a slightly different perspective.
One of our principles is to not optimize before it's time. With that in mind, it seems like using SELECT * should be the preferred method until it is proven to be a resource issue or the schema is pretty much set in stone. Which, as we know, won't occur until development is completely done.
That said, is there an overriding issue to not use SELECT *?
The essence of the quote of not prematurely optimizing is to go for simple and straightforward code and then use a profiler to point out the hot spots, which you can then optimize to be efficient.
When you use select * you're make it impossible to profile, therefore you're not writing clear & straightforward code and you are going against the spirit of the quote. select * is an anti-pattern.
So selecting columns is not a premature optimization. A few things off the top of my head ....
If you specify columns in a SQL statement, the SQL execution engine will error if that column is removed from the table and the query is executed.
You can more easily scan code where that column is being used.
You should always write queries to bring back the least amount of information.
As others mention if you use ordinal column access you should never use select *
If your SQL statement joins tables, select * gives you all columns from all tables in the join
The corollary is that using select * ...
The columns used by the application is opaque
DBA's and their query profilers are unable to help your application's poor performance
The code is more brittle when changes occur
Your database and network are suffering because they are bringing back too much data (I/O)
Database engine optimizations are minimal as you're bringing back all data regardless (logical).
Writing correct SQL is just as easy as writing Select *. So the real lazy person writes proper SQL because they don't want to revisit the code and try to remember what they were doing when they did it. They don't want to explain to the DBA's about every bit of code. They don't want to explain to their clients why the application runs like a dog.
If your code depends on the columns being in a specific order, your code will break when there are changes to the table. Also, you may be fetching too much from the table when you select *, especially if there is a binary field in the table.
Just because you are using all the columns now, it doesn't mean someone else isn't going to add an extra column to the table.
It also adds overhead to the plan execution caching since it has to fetch the meta data about the table to know what columns are in *.
One major reason is that if you ever add/remove columns from your table, any query/procedure that is making a SELECT * call will now be getting more or less columns of data than expected.
In a roundabout way you are breaking the modularity rule about using
strict typing wherever possible. Explicit is almost universally
better.
Even if you now need every column in the table, more could be added
later which will be pulled down every time you run the query and
could hurt performance. It hurts performance because
You are pulling more data over the wire; and
Because you might defeat the optimizer's ability to pull the data right out of the index (for queries on columns that are all part of an index.) rather than doing
a lookup in the table itself
When TO use select *
When you explicitly NEED every column in the table, as opposed to needing every column in the table THAT EXISTED AT THE TIME YOU WROTE THE QUERY. For example, if were writing an DB management app that needed to display the entire contents of the table (whatever they happened to be) you might use that approach.
There are a few reasons:
If the number of columns in a database changes and your application expects there to be a certain number...
If the order of columns in a database changes and your application expects them to be in a certain order...
Memory overhead. 8 unnecessary INTEGER columns would add 32 bytes of wasted memory. That doesn't sound like a lot, but this is for each query and INTEGER is one of the small column types... the extra columns are more likely to be VARCHAR or TEXT columns, which adds up quicker.
Network overhead. Related to memory overhead: if I issue 30,000 queries and have 8 unnecessary INTEGER columns, I've wasted 960kB of bandwidth. VARCHAR and TEXT columns are likely to be considerably larger.
Note: I chose INTEGER in the above example because they have a fixed size of 4 bytes.
If your application gets data with SELECT * and the table structure in the database is changed (say a column is removed), your application will fail in every place that you reference the missing field. If you instead include all the columns in your query, you application will break in the (hopefully) one place where you initially get the data, making the fix easier.
That being said, there are a number of situations in which SELECT * is desirable. One is a situation that I encounter all the time, where I need to replicate an entire table into another database (like SQL Server to DB2, for example). Another is an application written to display tables generically (i.e. without any knowledge of any particular table).
I actually noticed a strange behaviour when I used select * in views in SQL Server 2005.
Run the following query and you will see what I mean.
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[starTest]') AND type in (N'U'))
DROP TABLE [dbo].[starTest]
CREATE TABLE [dbo].[starTest](
[id] [int] IDENTITY(1,1) NOT NULL,
[A] [varchar](50) NULL,
[B] [varchar](50) NULL,
[C] [varchar](50) NULL
) ON [PRIMARY]
GO
insert into dbo.starTest
select 'a1','b1','c1'
union all select 'a2','b2','c2'
union all select 'a3','b3','c3'
go
IF EXISTS (SELECT * FROM sys.views WHERE object_id = OBJECT_ID(N'[dbo].[vStartest]'))
DROP VIEW [dbo].[vStartest]
go
create view dbo.vStartest as
select * from dbo.starTest
go
go
IF EXISTS (SELECT * FROM sys.views WHERE object_id = OBJECT_ID(N'[dbo].[vExplicittest]'))
DROP VIEW [dbo].[vExplicittest]
go
create view dbo.[vExplicittest] as
select a,b,c from dbo.starTest
go
select a,b,c from dbo.vStartest
select a,b,c from dbo.vExplicitTest
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[starTest]') AND type in (N'U'))
DROP TABLE [dbo].[starTest]
CREATE TABLE [dbo].[starTest](
[id] [int] IDENTITY(1,1) NOT NULL,
[A] [varchar](50) NULL,
[B] [varchar](50) NULL,
[D] [varchar](50) NULL,
[C] [varchar](50) NULL
) ON [PRIMARY]
GO
insert into dbo.starTest
select 'a1','b1','d1','c1'
union all select 'a2','b2','d2','c2'
union all select 'a3','b3','d3','c3'
select a,b,c from dbo.vStartest
select a,b,c from dbo.vExplicittest
Compare the results of last 2 select statements.
I believe what you will see is a result of Select * referencing columns by index instead of name.
If you rebuild the view it will work fine again.
EDIT
I have added a separate question, *“select * from table” vs “select colA, colB, etc. from table” interesting behaviour in SQL Server 2005* to look into that behaviour in more details.
You might join two tables and use column A from the second table. If you later add column A to the first table (with same name but possibly different meaning) you'll most likely get the values from the first table and not the second one as earlier. That won't happen if you explicitly specify the columns you want to select.
Of course specifying the columns also sometimes causes bugs if you forget to add the new columns to every select clause. If the new column is not needed every time the query is executed, it may take some time before the bug gets noticed.
I understand where you're going regarding premature optimization, but that really only goes to a point. The intent is to avoid unnecessary optimization in the beginning. Are your tables unindexed? Would you use nvarchar(4000) to store a zip code?
As others have pointed out, there are other positives to specifying each column you intend to use in the query (such as maintainability).
When you're specifying columns, you're also tying yourself into a specific set of columns and making yourself less flexible, making Feuerstein roll over in, well, whereever he is. Just a thought.
SELECT * is not always evil. In my opinion, at least. I use it quite often for dynamic queries returning a whole table, plus some computed fields.
For instance, I want to compute geographical geometries from a "normal" table, that is a table without any geometry field, but with fields containing coordinates.
I use postgresql, and its spatial extension postgis. But the principle applies for many other cases.
An example:
a table of places, with coordinates stored in fields labeled x, y, z:
CREATE TABLE places (place_id integer, x numeric(10, 3), y numeric(10, 3), z numeric(10, 3), description varchar);
let's feed it with a few example values:
INSERT INTO places (place_id, x, y, z, description)
VALUES
(1, 2.295, 48.863, 64, 'Paris, Place de l\'Étoile'),
(2, 2.945, 48.858, 40, 'Paris, Tour Eiffel'),
(3, 0.373, 43.958, 90, 'Condom, Cathédrale St-Pierre');
I want to be able to map the contents of this table, using some GIS client. The normal way is to add a geometry field to the table, and build the geometry, based on the coordinates.
But I would prefer to get a dynamic query: this way, when I change coordinates (corrections, more accuracy, etc.), the objects mapped actually move, dynamically.
So here is the query with the SELECT *:
CREATE OR REPLACE VIEW places_points AS
SELECT *,
GeomFromewkt('SRID=4326; POINT ('|| x || ' ' || y || ' ' || z || ')')
FROM places;
Refer to postgis, for GeomFromewkt() function use.
Here is the result:
SELECT * FROM places_points;
place_id | x | y | z | description | geomfromewkt
----------+-------+--------+--------+------------------------------+--------------------------------------------------------------------
1 | 2.295 | 48.863 | 64.000 | Paris, Place de l'Étoile | 01010000A0E61000005C8FC2F5285C02405839B4C8766E48400000000000005040
2 | 2.945 | 48.858 | 40.000 | Paris, Tour Eiffel | 01010000A0E61000008FC2F5285C8F0740E7FBA9F1D26D48400000000000004440
3 | 0.373 | 43.958 | 90.000 | Condom, Cathédrale St-Pierre | 01010000A0E6100000AC1C5A643BDFD73FB4C876BE9FFA45400000000000805640
(3 lignes)
The rightmost column can now be used by any GIS program to properly map the points.
If, in the future, some fields get added to the table: no worries, I just have to run again the same VIEW definition.
I wish the definition of the VIEW could be kept "as is", with the *, but hélas it is not the case: this is how it is internally stored by postgresql:
SELECT places.place_id, places.x, places.y, places.z, places.description, geomfromewkt(((((('SRID=4326; POINT ('::text || places.x) || ' '::text) || places.y) || ' '::text) || places.z) || ')'::text) AS geomfromewkt FROM places;
Even if you use every column but address the row array by numeric index you will have problems if you add another row later on.
So basically it is a question of maintainability! If you don't use the * selector you will not have to worry about your queries.
Selecting only the columns you need keeps the dataset in memory smaller and therefor keeps your application faster.
Also, a lot of tools (e.g. stored procedures) cache query execution plans too. If you later add or remove a column (particularly easy if you're selecting off a view), the tool will often error when it doesn't get back results that it expects.
It makes your code more ambiguous and more difficult to maintain; because you're adding extra unused data to the domain, and it's not clear which you've intended and which not. (It also suggests that you might not know, or care.)
To answer you question directly: Do not use "SELECT *" when it makes your code more fragle to changes to the underlying tables. Your code should break only when a change is made to the table that directly affects requirments of your program.
Your application should take advantage of the abstraction layer that Relational access provides.
I don't use SELECT * simply because it is nice to see and know what fields I am retrieving.
Generally bad to use 'select *' inside of views because you will be forced to recompile the view in the event of a table column change. Changing the underlying table columns of a view you will get an error for non-existant columns until you go back and recompile.
It's ok when you're doing exists(select * ...) since it never gets expanded. Otherwise it's really only useful when exploring tables with temporary select statments or if you had a CTE defined above and you want every column without typing them all out again.
Just to add one thing that no one else has mentioned. Select * returns all the columns, someone may add a column later that you don't necessarily want the users to be able to see such as who last updated the data or a timestamp or notes that only managers should see not all users, etc.
Further, when adding a column, the impact on existing code should be reviewed and considered to see if changes are needed based on what information is stored in the column. By using select *, that review will often be skipped because the developer will assume that nothing will break. And in fact nothing may explicitly appear to break but queries may now start returning the wrong thing. Just because nothing explicitly breaks, doesn't mean that there should not have been changes to the queries.
because "select * " will waste memory when you don't need all the fields.But for sql server, their performence are the same.

Slow distinct query in SQL Server over large dataset

We're using SQL Server 2005 to track a fair amount of constantly incoming data (5-15 updates per second). We noticed after it has been in production for a couple months that one of the tables has started to take an obscene amount of time to query.
The table has 3 columns:
id -- autonumber (clustered)
typeUUID -- GUID generated before the insert happens; used to group the types together
typeName -- The type name (duh...)
One of the queries we run is a distinct on the typeName field:
SELECT DISTINCT [typeName] FROM [types] WITH (nolock);
The typeName field has a non-clusted, non-unique ascending index on it. The table contains approximately 200M records at the moment. When we run this query, the query took 5m 58s to return! Perhaps we're not understanding how the indexes work... But I didn't think we mis-understood them that much.
To test this a little further, we ran the following query:
SELECT DISTINCT [typeName] FROM (SELECT TOP 1000000 [typeName] FROM [types] WITH (nolock)) AS [subtbl]
This query returns in about 10 seconds, as I would expect, it's scanning the table.
Is there something we're missing here? Why does the first query take so long?
Edit: Ah, my apologies, the first query returns 76 records, thank you ninesided.
Follow up: Thank you all for your answers, it makes more sense to me now (I don't know why it didn't before...). Without an index, it's doing a table scan across 200M rows, with an index, it's doing an index scan across 200M rows...
SQL Server does prefer the index, and it does give a little bit of a performance boost, but nothing to be excited about. Rebuilding the index did take the query time down to just over 3m instead of 6m, an improvement, but not enough. I'm just going to recommend to my boss that we normalize the table structure.
Once again, thank you all for your help!!
You do misunderstand the index. Even if it did use the index it would still do an index scan across 200M entries. This is going to take a long time, plus the time it takes to do the DISTINCT (causes a sort) and it's a bad thing to run. Seeing a DISTINCT in a query always raises a red flag and causes me to double check the query. In this case, perhaps you have a normalization issue?
There is an issue with the SQL Server optimizer when using the DISTINCT keyword. The solution was to force it to keep the same query plan by breaking out the distinct query separately.
So we took queries such as:
SELECT DISTINCT [typeName] FROM [types] WITH (nolock);
and break it up into the following:
SELECT typeName INTO #tempTable1 FROM types WITH (NOLOCK)
SELECT DISTINCT typeName FROM #tempTable1
Another way to get around it is to use a GROUP BY, which gets a different optimization plan.
I doubt SQL Server will even try to use the index, it'd have to do practically the same amount of work (given the narrow table), reading all 200M rows regardless of whether it looks at the table or the index. If the index on typeName was clustered it may reduce the time taken as it shouldn't need to sort before grouping.
If the cardinality of your types is low, how about maintaining a summary table which holds the list of distinct type values? A trigger on insert/update of the main table would do a check on the summary table and insert a new record when a new type is found.
As others have already pointed out - when you do a SELECT DISTINCT (typename) over your table, you'll end up with a full table scan no matter what.
So it's really a matter of limiting the number of rows that need to be scanned.
The question is: what do you need your DISTINCT typenames for? And how many of your 200M rows are distinct? Do you have only a handful (a few hundred at most) distinct typenames??
If so - you could have a separate table DISTINCT_TYPENAMES or something and fill those initially by doing a full table scan, and then on inserting new rows to the main table, just always check whether their typename is already in DISTINCT_TYPENAMES, and if not, add it.
That way, you'd have a separate, small table with just the distinct TypeName entries, which would be lightning fast to query and/or to display.
Marc
A looping approach should use multiple seeks (but loses some parallelism). It might be worth a try for cases with relatively few distinct values compared to the total number of rows (low cardinality).
Idea was from this question:
select typeName into #Result from Types where 1=0;
declare #t varchar(100) = (select min(typeName) from Types);
while #t is not null
begin
set #t = (select top 1 typeName from Types where typeName > #t order by typeName);
if (#t is not null)
insert into #Result values (#t);
end
select * from #Result;
And looks like there are also some other methods (notably the recursive CTE #Paul White):
different-ways-to-find-distinct-values-faster-methods
sqlservercentral Topic873124-338-5
My first thought is statistics. To find last updated:
SELECT
name AS index_name,
STATS_DATE(object_id, index_id) AS statistics_update_date
FROM
sys.indexes
WHERE
object_id = OBJECT_ID('MyTable');
Edit: Stats are updated when indexes are rebuilt, which I see are not maintained
My second thought is that is the index still there? The TOP query should still use an index.
I've just tested on one of my tables with 57 million rows and both use the index.
An indexed view can make this faster.
create view alltypes
with schemabinding as
select typename, count_big(*) as kount
from dbo.types
group by typename
create unique clustered index idx
on alltypes (typename)
The work to keep the view up to date on each change to the base table should be moderate (depending on your application, of course -- my point is that it doesn't have to scan the whole table each time or do anything insanely expensive like that.)
Alternatively you could make a small table holding all values:
select distinct typename
into alltypes
from types
alter table alltypes
add primary key (typename)
alter table types add foreign key (typename) references alltypes
The foreign key will make sure that all values used appear in the parent alltypes table. The trouble is in ensuring that alltypes does not contain values not used in the child types table.
I should try something like this:
SELECT typeName FROM [types] WITH (nolock)
group by typeName;
And like other i would say you need to normalize that column.
An index helps you quickly find a row. But you're asking the database to list all unique types for the entire table. An index can't help with that.
You could run a nightly job which runs the query and stores it in a different table. If you require up-to-date data, you could store the last ID included in the nightly scan, and combine the results:
select type
from nightlyscan
union
select distinct type
from verybigtable
where rowid > lastscannedid
Another option is to normalize the big table into two tables:
talbe1: id, guid, typeid
type table: typeid, typename
This would be very beneficial if the number of types was relatively small.
I could be missing something but would it be more efficient if an overhead on load to create a view with distinct values and query that instead?
This would give almost instant responses to the select if the result set is significantly smaller with the overhead over populating it on each write though given the nature of the view that might be trivial in itself.
It does ask the question how many writes compared to how often you want the distinct and the importance of the speed when you do.